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Abstract 


In  this  paper  we  construct  an  analytic  model  of  cache  misses  during  matrix  multiplication.  The 
analysis  in  this  paper  applies  to  square  matrices  of  size  2"'  where  the  array  layout  function  is  given  in 
terms  of  a  function  0  that  interleaves  the  bits  in  the  binary  expansions  of  the  row  and  column  indices. 
We  first  analyze  the  number  of  cache  misses  for  direct-mapped  caches  and  then  indicate  how  to  extend 
this  analysis  to  A-way  associative  caches. 

The  work  in  this  paper  accomplishes  two  things.  First,  we  construct  fast  algorithms  to  estimate  the 
number  of  cache  misses.  Second,  we  develop  theoretical  understanding  of  cache  misses  that  will  allow 
us,  in  subsequent  work,  to  approach  the  problem  of  minimizing  cache  misses  by  appropriately  choosing 
the  bit  interleaving  function  that  goes  into  the  array  layout  function. 


1  Introduction 

As  the  gap  between  processor  cycle  time  and  main  memory  access  time  continues  to  widen,  effective  use 
of  the  memory  hierarchy  becomes  ever  more  critical  to  overall  program  performance.  Caches  can  help 
alleviate  the  CPU-memory  gap  by  satisfying  most  memory  references  at  close  to  processor  speed  (1  to  3 
cycles).  Unfortunately,  programs  that  do  not  exhibit  good  memory  reference  locality  cannot  exploit  the 
potential  benefits  of  caches. 

For  scientific  computations  that  repeatedly  access  large  data  sets,  good  locality  of  reference  is  essential 
at  the  algorithm  level  for  high  performance.  Such  locality  can  either  be  temporal ,  in  which  a  single  data 
item  is  reused  repeatedly,  or  spatial ,  in  which  a  group  of  data  items  “adjacent”  in  space  are  used  in  temporal 
proximity.  High-performance  dense  linear  algebra  codes  rely  on  good  spatial  and  temporal  locality  of  ref¬ 
erence  for  their  performance.  In  this  paper,  we  focus  on  an  analysis  of  matrix  multiplication,  the  workhorse 
of  modern  linear  algebraic  algorithms. 

Our  previous  studies  demonstrated  an  intimate  relationship  between  the  layout  of  the  arrays  in  memory 
and  the  performance  of  the  routine  [1,  2],  This  early  work  experimentally  showed  the  benefits  of  using 
array  layout  functions  based  on  interleaving  the  bits  in  the  binary  expansions  of  the  row  and  column  indices 
of  arrays.  This  paper  complements  our  earlier  empirical  studies  by  providing  an  analytical  framework  for 
analyzing  the  cache  behavior  of  matrix  multiplication  in  the  presence  of  such  array  layout  functions.  Future 
work  will  use  this  framework  in  an  optimization  context,  to  determine  array  layouts  that  minimize  the 
number  of  cache  misses. 

The  remainder  of  this  section  provides  the  background  of  the  cache  analysis  problem.  Section  1.1 
provides  a  brief  overview  of  cache  memory  basics.  Section  1.2  describes  our  analysis  framework — both  the 
similarities  to  earlier  work  and  the  critical  differences  that  require  us  to  use  completely  different  techniques. 
Section  1.3  discusses  array  layout  functions  based  on  bit  interleaving.  Section  1.4  reiterates  the  goals  of  our 
analysis  and  provides  a  roadmap  of  the  remainder  of  the  paper. 

1.1  Basics  of  cache  memory 

We  assume  a  simplified  memory  hierarchy  that  processes  one  memory  access  at  a  time,  with  no  distinction 
between  memory  reads  and  writes. 

The  structure  of  a  single  level  of  a  memory  hierarchy — called  a  cache — is  generally  characterized  by 
three  parameters:  Associativity,  Block  size,  and  Capacity.  Capacity  and  block  size  are  in  units  of  the 
minimum  memory  access  size  (usually  one  byte).  A  cache  can  hold  a  maximum  of  C  bytes.  However,  due 
to  physical  constraints,  the  cache  is  divided  into  cache  frames  of  size  B  that  contain  B  contiguous  bytes 
of  memory — called  a  memory  block.  The  associativity  A  specifies  the  number  of  different  frames  in  which 
a  memory  block  can  reside.  If  a  block  can  reside  in  any  frame  (/.<?.,  A  =  ^),  the  cache  is  said  to  be.  fully 
associative ;  if  A  =  1,  the  cache  is  direct-mapped ;  otherwise,  the  cache  is  A-way  set  associative. 
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For  a  given  memory  access,  the  hardware  inspects  the  cache  to  determine  if  the  corresponding  memory 
element  is  resident  in  the  cache.  This  is  accomplished  by  using  an  indexing  function  to  locate  the  appropriate 
set  of  cache  frames  that  may  contain  the  memory  block.  If  the  memory  block  is  resident,  a  cache  hit  is  said 
to  occur,  and  the  cache  satisfies  the  access  after  its  access  latency.  If  the  memory  block  is  not  resident,  a 
cache  miss  is  said  to  occur. 

From  an  architectural  standpoint,  cache  misses  fall  into  one  of  three  classes  [7], 

•  A  compulsory  miss  is  one  that  is  caused  by  referencing  a  previously  unreferenced  memory  block. 
Eliminating  a  compulsory  miss  requires  prefetching  the  data,  either  by  an  explicit  prefetch  operation 
or  by  placing  more  data  items  in  a  single  memory  block. 

•  A  reference  that  is  not  a  compulsory  miss  but  misses  in  a  fully-associative  cache  with  LRU  replace¬ 
ment  is  classified  as  a  capacity  miss.  Capacity  misses  are  caused  by  referencing  more  memory  blocks 
than  can  fit  in  the  cache.  Restructuring  the  program  to  re-use  blocks  while  they  arc  in  cache  can  reduce 
capacity  misses. 

•  A  reference  that  hits  in  a  fully-associative  cache  but  misses  in  an  .4 -way  set-associative  cache  is 
classified  as  a  conflict  miss  (or  interference  miss).  A  conflict  miss  to  block  X  indicates  that  block  X 
has  been  referenced  in  the  recent  past,  since  it  is  contained  in  the  fully-associative  cache,  but  at  least 
A  other  memory  blocks  that  map  to  the  same  cache  set  have  been  accessed  since  the  last  reference  to 
block  X.  Eliminating  conflict  misses  requires  transforming  the  program  to  change  either  the  memory 
allocation  and/or  layout  of  the  two  arrays  (so  that  contemporaneous  accesses  do  not  compete  for  the 
same  sets)  or  the  manner  in  which  the  arrays  arc  accessed. 

At  the  program  source  level,  interference  misses  can  be  further  subdivided  based  on  whether  the 
interfering  blocks  come  from  different  parts  of  a  single  array,  or  from  different  arrays.  The  miss  is 
called  a  self-interference  miss  in  the  former  case  and  a  cross-interference  miss  in  the  latter  case  [8], 

1.2  An  analysis  framework 

Our  general  model  for  counting  cache  misses  follows  the  framework  used  in  previous  work  [5],  with  one 
significant  difference.  We  first  explain  the  common  framework,  then  highlight  the  key  difference  in  our 
version  of  the  problem  that  necessitates  entirely  new  solution  techniques. 

The  program  fragment  whose  cache  behavior  we  are  frying  to  analyze  is  a  perfectly  nested  normalized 
loop  with  d  levels  of  nesting,  numbered  1  through  d  from  outermost  to  innermost.  The  lower  and  upper 
bounds  of  ij,  the  loop  control  variable  (LCV)  for  loop  j,  are  affine  functions  of  the  LCVs  l\  through  / ;  _  | . 
The  iteration  space  lis  the  set  of  all  valid  combinations  of  LCV  values  that  are  within  the  bounds  of  the 

loop  nest.  The  notation  t  =  [( \ . £d.-i]T  denotes  a  generic  point  in  the  iteration  space  1.  The  iteration 

space  is  also  equipped  with  a  total  order  <,  which  is  the  lexicographic  ordering  on  t.  The  order  specifies  the 
temporal  order  in  which  the  iteration  points  in  the  iteration  space  arc  executed. 

The  loop  accesses  elements  of  arrays  A*1*  through  A  A  Array  variable  has  d ■  dimensions,  with 
n.j  being  the  extent  of  the  array  in  the  jth  dimension.  The  data  index  space  Vi  corresponding  to  array  A^ 
is  the  Cartesian  product  [0,  n i  —  1]  x  •  •  •  x  [0,  ny  —  1], 

The  statements  in  the  loop  body  make  k  references  to  array  variables.  We  denote  these  references  H\ 
through  Rk .  A  reference  //■  has  two  components:  A),  the  name  of  the  array  referenced  (so  that  V;  =  .1  ■J  '1 
for  I  f  j  f  m);  and  / ,.  the  index  expression  of  the  reference,  which  identifies  the  coordinates  of  the  array 
element  accessed  by  this  reference  at  iteration  point  / .  The  index  expression  Ft  is  constrained  to  be  an  affine 
function  of  i  in  each  of  its  components.  Thus,  Ft  is  a  function  from  the  iteration  space  Ito  the  data  index 
space  P  \i .  We  also  assume  that  //,  is  the  / th  array  reference  made  at  iteration  point  t. 
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Array  A;  has  an  associated  layout  function  £ ; ,  which  is  a  1-1  map  from  V;  to  the  memory  address  space 
Z+  Applying  this  map  to  an  element  of  an  array  produces  the  byte  address  of  that  array  element. 

We  assume  a  two-level  memory  hierarchy,  with  a  direct-mapped  cache  with  block  size  of  B  bytes  and 
total  capacity  of  C  bytes  (and  therefore  p  =  C / B  sets).  The  quantities  B  and  C  arc  always  powers  of  two 
for  technological  reasons,  so  we  will  assume  that  p  =  2  A  We  also  assume  that  main  memory  is  large  enough 
to  hold  all  the  data  referenced  by  the  program.  The  function  ^converts  a  memory  byte  address  address  into  a 
memory  block  address  (with  B(a )  =  [a/B J).  The  function  ^converts  a  memory  block  address  to  the  cache 
set  to  which  it  maps  (thus,  S(b)  =  b  mod  S ). 

Putting  all  of  this  notation  together,  we  have  the  following  table  of  objects  of  interest  and  their  mathe¬ 
matical  representations. 


Object 

Mathematical  representation 

An  iteration  point 

i 

The  /th  aiTay  reference  at  that  iteration 

Ri  =  (A:j,  Fi) 

The  amay  element  accessed  by  R,  at  1 

et  =  Aj[Fi(i)] 

The  byte  address  of  e8 

mt  =  £j{Ft{i)) 

The  block  address  of  int 

bi  =  B(£j(Fi(i))) 

The  cache  set  to  which  hi  maps 

si  =  S(B(£j(Fi(t)))) 

Example  1  Consider  the  following  loop  nest  for  matrix  multiplication  (the  so-called  ikj  variant),  which 
will  be  the  specific  computation  whose  cache  behavior  we  analyze  in  the  remainder  of  this  paper. 


for  (i  =  0;  i  <  n;  i++) 
for  (k  =  0;  k  <  n;  k++) 
for  (j  =  0;  j  <  n;  j++) 

C[i]  [j]  =  C[i]  [j]  +  A [ i ]  [k]*B[k]  [ j] ; 


This  loop  nest  has  depth  cl  =  3.  The  LCVs  arc  l\  =  i,  i2  =  C  and  /3  =  j.  The  loop  nest  accesses 
three  arrays:  =  A,  ,1 ( 2 1  =  B,  and  A*3)  =  C.  Each  array  is  two-dimensional,  so  that  V\  =  V,  = 

V 3  =  [0,  n  —  1]  x  [0,  n  —  1].  There  are  four  array  references:  R\  =  A[«] [Ar],  R2  =  P[/r][j],  f?3  =  C'[i][j] 
(the  read  access),  and  f?4  =  (  '[/][/]  (the  write  access).  The  index  expressions  of  the  four  references  arc 


Fi 


1  0  0 
0  1  0 


i  ,  f2 


0  1  0 
0  0  1 


•  /,  and  !  ]■  =  !  \ 


1  0  0 
0  0  1 


We  defer  the  discussion  of 


the  layout  functions  of  the  three  arrays  to  later  in  this  section. 


In  the  remainder  of  the  paper  we  will  work  in  units  of  array  elements  rather  than  bytes.  Given  that 
32  bytes  is  a  popular  block  size  for  first-level  caches  in  many  modern  machines,  and  that  double-precision 
numbers  arc  represented  with  eight  bytes,  we  will  assume  in  this  paper  that  memory  blocks  and  cache  blocks 
hold  four  array  elements. 

The  goal  of  cache  analysis  is  to  efficiently  estimate  the  number  of  capacity  and  conflict  misses  of  a  given 
code  fragment,  given  the  numerical  value  of  the  loop  bounds,  a  cache  configuration,  and  the  layout  functions 
of  the  arrays.  To  formulate  the  conditions  under  which  the  reference  Rt  =  (A,  Ft)  misses  at  iteration  point 
w  because  it  was  replaced  by  reference  Rj  =  ( B ,  Fj),  let  u  =  Last(w)  be  the  most  recent  iteration  point 
that  accessed  64  =  B(£  \  ( ) ),  the  block  being  accessed  by  reference  Ri  at  iteration  point  w.  Let  v 
be  an  iteration  point  satisfying  u  <  v  <  w  at  which  the  memory  block  h ^  =  B(£h  ( /(,  ( r) ) )  accessed  by 
reference  Rj  displaced  block  b  \  from  cache.  This  condition  is  satisfied  iff 


S(B(£A(Fi(w))))  =  S(B(£B(Fj(v))). 


(1) 
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Equation  (1)  captures  both  capacity  and  conflict  misses,  but  does  not  distinguish  between  the  two.  (Dis¬ 
criminating  between  these  miss  classes  would  require  the  additional  ability  to  ascertain  the  hit/miss  status 
of  the  reference  in  a  fully-associative  cache.)  It  does  not  capture  compulsory  misses,  as  such  misses  corre¬ 
spond  to  iterations  w  for  which  Last(  w)  is  not  defined.  We  use  the  term  replacement  misses  to  encompass 
capacity  and  conflict  misses.  We  omit  compulsory  misses  from  the  scope  of  this  paper  for  two  reasons:  they 
are  unavoidable  misses  that  cannot  be  reduced  by  optimization  techniques,  and  they  need  to  be  formulated 
completely  differently.  It  is  clear  that  a  simple  strategy  to  count  misses  is  through  simulation  of  the  code. 
This  is  exactly  what  cache  simulators  do.  The  main  drawback  of  simulation  is  its  slowness:  it  takes  time 
proportional  to  the  actual  execution  of  the  code,  usually  with  a  significant  multiplicative  factor  (10  —  100 
is  typical).  In  the  matrix  multiplication  example  of  Example  1,  this  time  is  @(n3).  Our  interest  is  in  much 
faster  algorithms,  whose  existence  is  suggested  by  the  regularity  of  the  array  access  patterns  and  the  limited 
number  of  cache  sets  to  which  they  map.  By  using  these  regularities  to  tame  the  potential  combinatorial 
explosion  of  cases,  we  will  in  fact  demonstrate  algorithms  that  accurately  compute  the  number  of  cache 
misses  for  the  matrx  multiplication  example  in  O(max(log  n .  log (C/B)))  time. 

Previous  work  [5]  at  this  point  introduces  two  additional  constraints  to  make  the  problem  tractable.  First, 
it  assumes  that  the  layout  functions  arc  row-  or  column-major,  which  is  affine  in  the  array  co-ordinates. 
We  will  subsequently  use  the  term  canonical  layout  to  refer  to  these  two  layout  functions.  Second,  it 
assumes  that  Last(w)  can  be  obtained  through  reuse  vectors ,  which  occurs  when  the  array  index  expressions 
arc  uniformly  generated  in  addition  to  being  affine  in  the  LCVs.  These  two  conditions  keep  everything 
within  the  polyhedral  model  [3],  which  has  been  well-studied  and  for  which  counting  algorithms  are  well- 
known  [9] .  It  is  at  this  point  that  our  work  diverges  from  previous  work. 

Prior  empirical  evidence  [4,  1,  2]  suggests  that  alternative  array  layout  functions  such  as  Morton  order  [2] 
provide  better  cache  behavior  than  canonical  layout  functions  for  many  dense  linear  algebra  codes.  Such 
layout  functions  are  described  in  terms  of  interleavings  of  the  bits  in  the  binary  expansions  of  the  array 
co-ordinates  rather  than  as  affine  functions  of  the  numerical  values  of  these  quantities.  This  single  change 
puts  our  version  of  the  problem  beyond  the  scope  of  the  solution  techniques  for  the  polyhedral  model.  We 
will  therefore  need  to  investigate  different  techniques  for  counting  the  number  of  solutions  to  equations  such 
as  equation  (1). 

1.3  Array  layouts  based  on  bit  interleaving 

In  developing  this  model  of  alternative  array  layouts,  we  assume  that  n  =  2  m ,  so  that  the  bit  representation 
of  an  array  index  will  have  m  bits,  with  the  least  significant  bit  (LSB)  numbered  0  and  the  most  significant 
bit  (MSB)  numbered  in  —  1.  We  identify  the  binary  sequence  sm_i  . . .  s0  with  the  non-negative  integer 
s  =  Si‘2l.  We  denote  by  Bm  the  set  of  all  binary  sequences  of  length  in,  and  extend  the  above 

identification  to  identify  Bm  with  interval  [0,  2m  —  1]. 

We  will  describe  a  family  of  nonlinear  layout  functions  parameterized  by  a  single  parameter  a,  as  fol¬ 
lows.  An  (m,  m)-interleaving,  a,  is  a  2m-bit  binary  sequence  containing  m  Os  and  in  Is.  It  describes  the 
order  in  which  bits  from  the  two  array  coordinates  are  interleaved  to  linearize  the  array  in  memory.  Given 
a,  define  its  characteristic  sequence  \a  to  be  the  sequence  with  entries  /,-  and  defined  by  replacing  the 
(i  +  l)st  0  from  the  right  in  a  by  /;  and  the  (  i  +  l)st  1  from  the  right  in  a  by  s8  .  (The  letters  /  and  s  are 
chosen  for  mnemonic  reasons:  they  are  the  initial  letters  of  the  words  “first”  and  “second”.) 

Example  2  Let  m  =  4  and  let  a  =  10110010.  Then  \a  =  S3/3S2S1  /2/1  -so,/o-  Next,  let  in  =  3  and  let 
cr  =  010011.  In  this  case,  =  f2s2fifosis0. 

Given  an  (in,  in) -interleaving  a,  define  a  map 

0  :  B  m  X  Bm  y  -02  m 
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in  the  following  way.  If  a  =  am_i  . . .  aia0  £  and  b  =  6m_i  . . .  bib0  £  Bm,  then  0(a,  b)  is  the 
sequence  obtained  by  replacing  each  /,-  in  \ a  by  a;  and  each  in  by  bt.  We  extend  this  notation  to 
consider  0  as  a  map  from  [0,  2m  —  1]  x  [0,  2m  —  I]  — >  [0,  22m  —  1]  by  identifying  non-negative  integers  and 
their  binary  expansions.  We  call  0  the  mixing  function  indexed  by  o.  Note  that  0(0,  0)  =  0  for  any  a . 


Example  3  Let  m  =  4  and  let  a  =  01101001  so  that  \  a  =  /3s3s2/2si/i/0s0.  Then 

0(12,  5)  =  0(1100,  0101)  =  10110001  =  128  +  32  +  16  +  1  =  177. 

Next,  let  cr  =  10110010  so  that  =  s3/3s2si/2i/js0/o-  In  this  case, 

0(9,6)  =  0(1001,0110)  =  01110001  =  64+  32+  16+  1  =  113. 

Many  popular  layout  functions  fall  into  this  class.  For  example,  row-major  layout  corresponds  to  the 

n  n  n  n 

signature  cr  =  0...01...1;  column-major  layout  corresponds  to  the  signature  a  =  1 ...  1  0  ...  0;  pure 

2  n 

Morton  layout  corresponds  to  the  signature  cr  =  01 .  .  .01;  a  combination  of  Morton  layout  with  2k  /  '2k 

2(n-k)  k  k 

tiles  arranged  in  row-major  order  corresponds  to  the  signature  cr  =  01...010...01...1;  and  so  on. 

We  are  now  ready  to  discuss  the  matter  of  the  layout  functions  of  the  three  arrays  in  our  matrix  multi¬ 
plication  example.  Given  an  arbitrary  array  element  indexed  (r,  c),  the  quantity  0(r,  c)  gives  the  position  of 
the  element  (r,  c)  relative  to  the  stalling  position  of  the  array  in  memory.  We  use  the  generic  notation  //  to 
denote  this  starting  address.  Specifically,  we  assume  the  following  forms  of  layout  functions  for  .4,  B,  and 
C: 


£-A{r,c)  =  /ii  +  0(r,  c) 

CB{r,c)  =  /i2  +  0(r,  c) 

Cc{r,c)  =  /i3  +  0(r,  c) . 

1.4  Goals  and  structure  of  the  paper 

Our  overall  goal,  to  be  studied  in  a  subsequent  paper,  is  to  find  the  layout  functions  of  the  form  shown  above 
that  minimize  cache  misses.  In  this  paper,  we  create  an  analytic  model  of  cache  misses  using  layout  functions 
of  this  form,  and  we  use  this  model  to  estimate  the  number  of  cache  misses  in  the  matrix  multiplication 
example.  These  results  will  form  the  basis  for  the  analysis  in  future  work. 

The  counting  of  cache  misses  for  the  matrix  multiplication  example  is,  in  the  end,  a  giant  case  analysis  of 
all  possible  patterns  of  interference  among  the  various  arrays.  Fortunately,  this  analysis  ultimately  reduces 
to  solving  two  enumeration  problems,  which  are  then  adapted  and  augmented  in  diverse  ways,  and  finally 
combined  using  inclusion-exclusion.  We  first  discuss  the  two  enumeration  problems  and  their  solutions  in 
an  abstract  setting  in  Section  2.  We  then  adapt  these  algorithms  to  the  cache  model  in  Section  3  and  to  the 
problem  of  counting  cache  misses  in  Section  4.  We  extend  our  analysis  to  set-associative  caches  in  Section 
5,  and  conclude  in  Section  6. 

2  Two  Enumeration  Problems 

In  this  section  we  study  a  pair  of  counting  problems  which  together  form  the  foundation  for  our  enumeration 
of  cache  misses.  We  will  not  attempt  to  determine  closed-form  expressions  for  these  numbers — almost 
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certainly  the  answers  to  these  questions  cannot  be  put  in  elegant  closed  forms.  Instead,  our  goal  will  be  to 
describe  efficient  algorithms  to  determine  the  number  of  solutions. 

We  will  let  n  =  2m  and  p  =  2P  be  as  in  the  last  section.  For  any  positive  integer  q  we  will  let  Bq 
denote  the  number  of  binary  sequences  e  =  e?_i  . . .  eie0  of  length  q.  When  convenient,  we  will  treat  e  as  a 
non-negative  integer  in  the  range  0  to  2 9  —  1  using  the  usual  notion  of  binary  representation. 

2.1  Algorithm  AB(d) 

Given  an  (  m,  m  ) -interleaving  a,  an  integer  d  with  a  p-bit  binary  expansion,  and  an  initial  carry  k0  E  {0,  1}, 
we  want  to  determine  AB(d),  the  number  of  triples  (a,  6,  c)  E  such  that 

@(a,b)  =  0(6,  c)  +  d  +  ko  mod  2P  (2) 

under  the  condition  that  2  m  G  p.  A  correct  but  inefficient  algorithm  would  enumerate  all  possible  triples 
(a,  6,  c)  and  check  satisfiability  of  equation  (2)  for  each  triple.  Such  an  algorithm  would  have  time  complex¬ 
ity  of  0(2'"  +  (>).  The  basic  technique  that  we  will  use  to  derive  an  efficient  algorithm  of  time  complexity 
0(max(m,  p))  is  to  reason  about  individual  bits  of  the  terms  on  either  side  of  the  equation  in  terms  of 
whether  they  propagate  or  generate  carry  bits.  We  will  denote  by  kt  the  carry  input  at  bit  position  i  (or, 
equivalently,  the  carry  output  at  position  i  —  1).  Note  that  k 0,  the  carry  input  at  the  least  significant  bit,  is 
supplied. 

The  first  observation  is  that  we  can  simplify  the  problem  based  on  the  values  of  bits  d p-  \  through  <!  >,,, . 

Definition  2.1  (Consistency  of  d) 

Let  cr  be  an  (m,  m) -interleaving  and  let  d  =  d  p_ \  . .  ,d0  E  Bp.  Let  r  =  [u. . . . .  r]  be  a  subsequence  of 
P  =  [0, . . . ,  p  —  1].  We  say  that  d  is  <  -consistent  on  r  if  dj  =  e  for  all  j  E  r.  We  say  that  d  is  inconsistent 
on  r  if  it  is  neither  O-consistent  not  1-consistent  on  r. 

Lemma  2.1  Equation  (2)  has  no  solutions  if  d  is  inconsistent  on  [2m  , . . . ,  p  —  1].  For  e  E  {0, 1},  if  d  is 
e-consistent  on  [2 m , . . . ,  p  —  1],  then  equation  ( 2 )  has  solutions  iff  k  2m  =  kp  =  e. 

Proof:  By  case  analysis  on  bits  dp_  i  through  .  □ 

This  reduces  the  original  problem  to  that  of  counting  the  number  of  solutions  to  a  reduced  system  E  of 
2m  bit-equations,  and  separating  the  solutions  of  E  based  on  the  value  of  k2m  that  they  produce.  Let  n,  be 
the  number  of  solutions  of  E  that  produce  k2m  =  ^  for  e  E  {0,  1}.  Then  we  have  the  following  expression 
for  AB(d): 

{no,  if  dp- 1  =  •  •  •  =  d2m  =  0 

n  i ,  if  dp- 1  =  •  •  •  =  d2m  =  1  (3) 

0,  otherwise. 

We  will  now  give  an  algorithm  to  determine  the  pair  (n0 ,  n  | ) . 

Let  us  label  the  '2m  components  of  E  with  the  numbers  0  through  2m  —  1,  with  t  being  the  label  of  the 
equation  corresponding  to  bit  position  /.  Bit  equation  t  has  one  of  two  forms: 

bi  =  c;  +  df  (4) 

=  bi  +  df  (5) 

where  0  G  /  <  m  .  For  any  fixed  i,  there  is  exactly  one  equation  of  form  (4)  and  one  equation  of  form  (5) 
(of  course,  with  different  values  of  t).  Of  these,  call  the  equation  with  larger  value  of  t  the  major  (-equation, 
an  the  equation  with  smaller  value  of  t  the  minor  (-equation. 
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The  +  in  the  above  equations  is  to  be  interpreted  as  binary  addition,  with  hidden  carry  bits.  To  make  this 
explicit,  we  rewrite  the  component  equations  in  a  more  elaborate  form,  using  the  operations  exclusive-or 
(denoted  ©)  and  majority  (denoted  MAJ).  For  equation  (4)  we  get 


while  for  equation  (5)  we  get 


bi  =  c.i  ©  c lt  ©  kt  (6) 

ki.+ 1  =  MAJ (cj  ,  df,  kf)  (7) 

at  =  bi  ©  dt  ©  kt  (8) 

kt+ 1  =  MAJ  (bi,dt,kt)  (9) 


Our  interest  is  not  so  much  in  specific  values  of  the  bits  </  />;,  and  >■;,  but  rather  on  the  terminal  cany 

k2m  that  any  particular  assignment  of  bits  produces.  As  b  is  the  only  variable  that  occurs  on  both  sides  of 
component  equations,  a  particular  choice  of  b  uniquely  determines  values  of  a  and  c.  We  will  therefore  use 
the  bits  of  b  to  collect  solution  triples  that  generate  a  common  terminal  carry.  Looking  at  the  behavior  of 
component  equation  t  for  a  specific  choice  of  ©,  we  observe  that  it  has  three  possible  modes. 

1.  /,•/_)_  i  =  kt.  We  call  this  mode  Propagate,  or  P  for  short. 

2.  /,•/+ 1  =  0,  independent  of  kt.  We  call  this  mode  O-Generate,  or  Go  for  short. 

3.  /,•/_)_  i  =  1,  independent  of  kt.  We  call  this  mode  1-Generate,  or  G\  for  short. 

The  following  lemma  relates  these  modes  to  the  choice  of  value  bi . 


Lemma  2.2  Equation  (4)  behaves  in  mode  P  if  we  set  bi  =  dt  and  in  mode  Gdt  ifbi  =  df  .  Equation  (5) 
behaves  in  mode  Gdt  if  we  set  bi  =  dt  and  in  mode  P  ifbi  =  df. 

Proof:  Simple  case  analysis  based  on  possible  values  of  bi  and  of  kt.  □ 

The  key  idea  in  the  algorithm  is  to  capitalize  on  the  Go  and  G\  modes.  Consider  6m_i,  the  most 
significant  bit  of  b.  The  major  (m  —  l)-equation  occurs  at  position  2m  —  1,  and  the  minor  (m  —  l)-equation 
occus  at  some  position  s  with  s  <  2m  —  1.  Depending  on  the  form  of  equation  2m  —  1  and  the  value  of 
d-2m-i-  one  of  the  two  choices  for  6m_ i  will  lead  to  a  (©-mode  (with  e  E  {0,  1}).  This  means  that  no  matter 
what  values  we  assign  to  bits  b  jn_  >  through  b0,  they  will  all  contribute  to  n , .  We  can  therefore  increment 
n,  by  2m_1.  The  other  choice  of  bn,  _  i  will  lead  to  /©node  for  the  major  equation  (and  some  mode  for  the 
minor  equation  as  determined  by  Lemma  2.2).  In  this  case,  we  need  to  explore  further  the  assignment  of 
values  to  lower-order  bits  of  b  to  separate  those  assignments  that  contribute  to  n  0  from  those  that  contribute 
to  n  | .  To  do  this,  we  will  symbolically  reduce  the  major  and  minor  (  m  —  1) -equations  to  their  modes  for 
this  choice  of  6m_i ,  and  proceed  to  the  equations  involving  6m_ 2. 

The  reason  behind  the  reduction  of  component  equations  to  behavior  modes  becomes  clear  if  we  consider 
the  situation  when  we  are  considering  how  the  assignment  of  values  to  bi,  with  0  ©'  <  m  —  1,  affects  the 
counts  no  and  n i.  The  fact  that  we  arc  reasoning  about  bi  means: 

•  that  we  have  already  considered  the  bits  6m_ i  through  bl+ 1 ; 

•  that  we  have  identified  the  unique  assignment  of  values  of  these  bits  that  leads  to  /©nodes  for  the 
major  (m  —  1) -equation  through  the  major  (/  +  l)-equation; 

•  that  we  have  reduced  all  of  these  major  and  minor  equations  to  their  appropriate  behavior  modes  for 
these  assignments  of  values  to  bits  6m_ i  through  bt+ 1 . 


If  component  equation  /  is  the  major  /-equation,  then  this  means  that  component  equations  m  —  1  through 
/  +  I  have  been  reduced.  (Some  of  the  component  equations  /  —  I  through  0  may  also  have  been  reduced; 
this  does  not  concern  us  yet,  because  carries  move  from  lower-order  to  higher-order  bits.)  In  any  case,  one 
of  the  two  choices  of  6;  will  lead  to  a  G,  mode  for  component  equation  /.  However,  we  cannot  at  this 
point  simply  increment  n ,  by  2i_1,  since  the  generated  carry  kt+ 1  may  be  altered  as  it  travels  through  the 
reduced  component  equations  /  +  I  through  m  —  1.  What  we  need  to  do  is  to  determine  the  value  k-2m  =  8 
that  emerges  at  the  other  end  of  this  process,  and  increment  n$  by  2i_1.  The  representation  of  component 
equations  as  modes  facilitates  the  determination  of  k-2m. 

One  final  observation  about  the  algebraic  structure  of  modes  allows  us  to  calculate  the  terminal  carry 
k2m  in  a  constant  number  of  operations.  It  is  easily  seen  that  the  mode  set  { P,  G 0,  G'i }  is  a  monoid  under 
composition,  with  P  as  the  identity  element.  Composition  is  defined  by  the  following  table. 


P 

G'o 

G'i 

p 

P 

G'o 

G'i 

Go 

Go 

G'o 

G'o 

G'i 

G'i 

G'i 

G'i 

In  trying  to  interpret  this  “composition  table”,  remember  that  carries  move  from  right  to  left.  Thus,  G\P 
means  that  an  input  carry  first  passes  through  a  P-mode  and  then  through  aGj  mode.  This  is  equivalent  to 
a  G i  mode.  Thus,  instead  of  maintaining  individual  modes  for  reduced  component  equations  /  +  1  through 
'2  m  —  1  and  laboriously  propagating  k/+\  through  them  to  obtain  k -2m,  we  can  keep  a  compact  description 
of  the  combined  effect  of  these  modes  and  obtain  k-2 g|  from  kt+ 1  in  a  single  step.  Furthermore,  we  can 
incrementally  update  this  description  as  we  move  to  lower-numbered  component  equations. 

We  are  now  ready  to  present  the  complete  algorithm  to  determine  (?/0,  «i). 

1  n  o  <—  0 

2  i  0 

3  mode  P 

4  i  m  —  1 

5  for  /  =  2m  -  1  downto  0  do 

6  if  component  equation  t  has  been  reduced  to  mode  M  then 

7  mode  COMPOSE(mode,  M)  /*  Use  composition  table  */ 

8  else  /*  This  is  the  major  /-equation  */ 

9  v  <r-  value  of  6;  that  makes  this  equation  behave  in  mode  G , ,  from  Lemma  2.2 

10  S  APPLY(mode,  e) 

11  715  F-  715  +  2* 

12  Locate  the  minor  /-equation  and  reduce  it  to  the  mode  resulting  from  setting  //;  =  r 

13  i  <—  /  —  1 

14  endif 
is  enddo 

is  S  APPLY(mode,  /.-u) 

17  715  ns  +  1 

Theorem  2.1  The  above  program  correctly  computes  no  and  n  i  and  runs  in  0(  m )  steps. 

Proof:  Immediate  from  Lemmas  2. 1  and  2.2.  □ 

Example  4  Let  o  =  001110,  let  d2m-i  •  •  •  do  =  011000  and  let  k0  =  1.  In  this  case,  the  equations  are: 


Eq  ■  f/Q  —  &0  +  0 
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E\  :  bo  —  Co  +  0 

E-2  :  bi  =  ci  +  0 

E3  :  b  -2  =  c2  +  1 

P4  :  a  1  —  b\  T  1 

E5  :  ci2  =  62  +  0 


The  system  of  equations  and  (n0,  «  i )  evolve  in  the  following  way  as  we  go  through  the  steps  of  the  algo¬ 
rithm: 

t  =  5:  Now  i  =  2.  Set  v  =  0  because  b2  =  0  makes  behave  in  mode  Go.  Then  b  =  0  because  mode  =  P 
and  e  =  0  (i.e.,  a  carry  of  0  propagates  through  the  reduced  component  equations).  Update  n0  =  0+1.  /,3 
is  the  minor  2-equation  and  gets  reduced  to  P-mode.  This  leaves: 


t  =  4:  Now  i  =  1.  Set  v 
to  P-mode.  This  leaves: 


E0  : 

ao 

=  b0 

+ 

0 

Ei  : 

bo  = 

=  Co 

+ 

0 

E2  : 

bi  = 

=  Cl 

+ 

0 

E3  : 

b2  = 

=  c2 

+ 

1 

P-mode 

P4  : 

<11  = 

=  bi 

+ 

1 

P5  : 

(12  = 

=  b2 

+ 

0 

P-mode 

=  1  and  <5  =  1.  Update  n  \  =0  +  2.  /.2  is  the  minor 


E0  : 

ao 

=  b0 

+ 

0 

Ei  : 

bo  = 

=  Co 

+ 

0 

E2  : 

bi  = 

=  Cl 

+ 

0 

P-mode 

E3  : 

b2  = 

=  c2 

+ 

1 

P-mode 

P4  : 

ai  = 

=  bi 

+ 

1 

P-mode 

E5  : 

(12  = 

=  b2 

+ 

0 

P-mode 

1 -equation  and  gets  reduced 


t  =  3:  mode  =  P  because  the  previous  value  of  mode,  P,  composed  with  the  mode  of  P3 ,  P,  is  P. 
t  =  2:  mode  =  P. 

/  =  I :  Now  /'  =  0.  Set  r  =  I  and  S  =  0.  Update  n0  =  4  +  1.  E0  is  the  minor  0-equation  and  gets  reduced 
to  G'o-mode.  This  leaves: 


E0  : 

(lQ 

=  b0  +  0 

G'o-mode 

Ei  : 

bo 

=  Co  +  0 

P-mode 

E2  : 

bi 

=  ci  +  0 

P-mode 

E3  : 

b2 

=  c2  +  1 

P-mode 

P4  : 

ai 

=  6i  +  l 

P-mode 

E5  : 

a  2 

=  &2  +  0 

P-mode 
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t  =  0:  mode  =  Go.  Set  5  =  0  and  update  ii0  =  5  +  1. 

The  final  values  for  (n0,  n  \ )  arc  (6,  2)  which  agrees  with  the  answer  obtained  by  explicit  generation  of 
all  solutions.  So,  if  dp~ i  =  •  •  •  =  d2m  =  0  then  ABfd)  =  6,  whereas  if  dp-\  =  •  •  •  =  d2m  =  1  then  AB(d) 
=  2. 

2.2  Algorithm  AC(d) 

We  now  investigate  the  following  problem:  Given  an  ( m .  m) -interleaving  a.  an  integer  d  with  a  p-bit  binary 
expansion,  determine  AC(d),  the  number  of  triples  (a,  6,  c)  £  li’)  such  that 

©(a,  b)  =  ©(a,  c)  +  d  mod  2f  (10) 

under  the  condition  that  2m  £  p.  This  problem  is  superficially  similar  to  equation  (2),  with  one  small 
but  critical  difference:  the  variable  that  occurs  on  both  sides  of  equation  (10)  occurs  in  the  0-positions  of 
cr  on  both  sides  of  the  equation,  whereas  the  variable  that  occurs  on  both  sides  of  equation  (2)  occurs  in 
the  0-position  of  a  on  one  side  of  the  equation  and  the  1-position  of  a  on  the  other  side  of  the  equation. 
This  difference  makes  the  combinatorics  of  equation  (10)  radically  different  from  the  combinatorics  of 
equation  (2),  leading  in  the  end  to  a  conceptually  simpler  algorithm  to  compute  AC(d). 

If  we  write  out  equation  (10)  in  terms  of  component  bit-equations  as  we  did  for  equation  (2),  we  see  that 
component  equation  t  (for  0  £  /  <  2m)  has  one  of  two  forms:  a |.‘<=  a;  +  df  if  ot  =  0,  and  6;  =  ct  +  df 
if  Of  =  1.  The  decoupling  of  the  bits  of  a  from  the  bits  of  b  and  c  indicates  that  the  (/-component  of 
any  solution  of  equation  (10)  can  be  chosen  independent  of  the  b-  and  c-components.  The  decoupling  also 
suggests  that  we  need  to  look  at  the  distribution  of  0s  and  Is  in  a.  Based  on  these  observations,  we  start 
with  a  few  definitions. 

Definition  2.2  (Runs  of  a) 

Let  cr  be  an  (m,  m) -interleaving,  and  let  P  be  the  sequence  [0, . . . ,  p  —  1],  For  e  £  {0,  1},  an  e-run  of  a  is 
a  maximal-length  subsequence  [it,  ..../•]  of  P  such  that  au  =  •  •  •  =  <jv  =  e,  where  cr2m.  through  a (9_1  arc 
declared  to  be  0.  Order  e-runs  in  increasing  order  of  u,  and  denote  the  ith  e-run  of  a  by  \ 

For  technical  reasons  that  will  soon  become  evident,  we  will  always  want  the  “lowest”  run  to  be  a  0-run. 
This  is  a  problem  only  when  a0  =  1.  In  this  case,  we  will  create  a  special  empty  0-run  //‘l|°*  and  label  the 
non-empty  0-runs  from  R^  onwards.  Thus,  ll\ 1  *  is  sandwiched  between  and  -  Note  also  that  the 
0-runs  constrain  possible  choices  of  a,  while  the  1-runs  constrain  possible  choices  of  b  and  c. 

We  obtain  strong  conditions  on  the  (non-)existence  of  solutions  of  equation  (10)  by  considering  the 
restrictions  of  d  to  the  0-runs  of  a.  The  intuition  behind  the  following  lemma  and  its  proof  are  small 
variations  of  Lemma  2.1. 

Lemma  2.3  Equation  (10)  has  no  solutions  if  d  is  inconsistent  on  any  0-run  of  a.  For  e  £  {0,  1},  if  d  is 
e-consistent  on  =  [u, . .  . ,  v],  then  equation  (10)  has  solutions  iffk  u  =  kv+ \  =  e.  If  is  empty ,  then 
every  d  is  declared  to  be  0-consistent  on  it.  □ 

Lemma  2.3  has  two  important  consequences.  First,  it  provides  an  early  termination  test  for  the  algorithm. 
Second,  if  d  is  indeed  consistent  on  all  0-runs  of  cr,  then  it  simplifies  the  counting  of  the  number  of  choices 
of  a  in  the  following  way.  Note  that  each  of  the  component  equations  is  of  the  form  a;  =  a;  +  dt.  Since  the 
same  element  of  a  appeal's  on  both  sides  of  the  equation,  there  is  in  fact  no  constraint  on  a!  Thus,  for  every 
possible  choice  of  b  and  c  that  we  discover  by  examining  the  1-runs  (which  we  will  do  shortly),  any  of  the 
2m  choices  of  a  will  work. 
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Consider  !l\ 1  *  =  [u, . . .,  r],  the  ith  1-run  of  0.  Recall  that  this  run  is  sandwiched  between  runs  R-0^  and 

Rj°\.  Let  d  be  -consistent  on  //!,"*.  Let  t 
ai-e  as  follows. 

=  ll\ '  *  +  •  •  •  +  .  Then  the  component  equations  in  !l\ 1  * 

bt 

—  Cf  77  du  77  hu 

C/+ 1 

=  MAJ (cf,  du,  ku) 

bt+ 1 

=  cf+i  ©  du+i  ©  k  u+ 1 

hu+2 

=  MAj(cf+i,  du+1,  ku+1) 

bt+v  —  u 

—  Of^-v—u.  ©  dv  ©  kv 

hy+1 

—  M A.I  (©-/+,. —  .  dv ,  k y) 

By  Lemma  2.3,  we  know  that  ku  =  and  kv+ 1  =  ~j+1.  Thus  we  are  constrained  by  being  given  the 
values  of  both  the  initial  and  terminal  carries  of  the  1-run,  and  must  determine  how  many  choices  of  bit 
values  for  b  and  c  honor  these  constraints.  It  turns  out  that  the  easiest  way  to  count  the  possibilities  is  to 
reason  about  the  bit  patterns  as  non-negative  integers.  To  this  end,  define  <1;  =  zu  +  df2J~  u .  That  is, 

Si  is  the  integer  corresponding  to  the  bit  pattern  d  v  ■  ■  ■  dUi.  with  the  initial  carry  value  absorbed  into  it.  Also, 
let  A;  =  2V~U+1  —  Si.  We  then  get  the  following  result  by  case  analysis  on  the  value  of  k  ,.+  | . 


Theorem  2.2  Let  a  be  an  (in,  m  ) -interleaving  and  let  d  £  Bp  be  consistent  on  all  0-runs  of  a.  Then  the 
number  of  solutions  to  equation  ( 10)  is  2  m  ■  Ili=i  Fi,  where  l  is  the  number  of  1-runs  ofo  and 


Fi  = 


A.; |  ifd  is  O-consistent  on 
Si ifd  is  1-consistent  on 


Proof:  By  equating  the  coefficients  of  the  distinct  powers  of  2  on  the  two  sides  of  (10)  we  arrive  at  a  set  of 
restrictions  on  the  sequences  a,  b ,  c.  Lemma  2.3  describes  restrictions  that  result  from  equating  coefficients 
of  powers  2T  where  r  is  in  a  0-run  of  o.  The  elements  of  a  appeal-  in  these  equations,  with  the  same  element 
of  a  appealing  on  both  sides.  This  gives  no  restrictions  on  a  and  so  there  are  2m  =  n  choices  for  a.  This 
accounts  for  the  factor  of  2m  that  appeal's  in  the  formula.  The  remaining  factors  will  count  the  number  of 
choices  we  have  for  b  and  c. 

Consider  restrictions  on  b  and  c  that  result  from  equating  coefficients  of  powers  2  T  for  r  in  a  particular 
1-run  R^\  Define  /l;  and  7 8  by  ft  =  bj'23~t  and  7*  =  12]= vf~u  cJ23~t. 

Case  1:  Suppose  z;+\  =  0.  Then  the  component  equations  on  !l\ 1  *  are  equivalent  to 


fi  -  7  i  = 


(11) 


where  we  have  equality  of  integers  in  equation  (11).  So,  the  number  of  choices  we  have  for  bj,  r,  satisfying 
the  component  equation  on  is  equal  to  the  number  of  integers  7;,  7;  with  0  <  7;  <  2y-  "+1,  0  <  7;  < 
■jv-u+i  qiat  satisfy  equation  (11).  For  each  7;  with  7;  <  7,;  <  2l’_",+1  there  is  exactly  one  choice  of  7; 
such  that  f  i,  7 i  satisfy  equation  (11).  For  0  <  7;  <  7;  there  are  no  choices  of  7;  such  that  f  i,  7;  satisfy 
equation  (11).  So  the  number  of  solutions  to  equation  (1 1)  is 

2V~U+1  -  Si  =  A; 

which  is  the  ith  factor  in  the  product  in  the  statement  of  the  theorem. 

Case  2:  Suppose  A+l  =  I-  Then  the  component  equations  on  R^  are  equivalent  to: 

fi  +  2l  U+1  =  Si  +  7  i 
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which  can  be  rewritten  as: 


lt-/3t=r~u+1  -Si.  (12) 

By  the  same  reasoning  as  above,  the  number  of  solutions  to  equation  (12)  is  Si,  which  is  the  ith  factor  in  the 
product  in  the  statement  of  the  theorem.  □ 

Example  5  Let  m  =  5,  p  =  12,  a  =  1000110110  and  d  =  111101001111.  We  will  use  Algorithm  AC  to 
compute  AC(d). 

In  Step  1  we  compute  the  runs  and  the  consistency  values  zt\ 

r[0)  =  [],  Cl  =  o 

r[1]  =  [0] 

R(20)  =  [1,2, 3],  =2  =  1 

Rl2]  =  [4,5] 

R(30)  =  [6],  =3  =  0 

41]  =  [7,8] 

R{40)  =  [9, 10, 11],  =4=1 

In  Step  2  we  compute  the  factors  /  ’;  and  use  them  to  determine  AC(d): 

1  Si  A  i  Fi 

1111 

2  3  11 

3  2  2  2 

So  AC(d)  =  32  •  1  •  1  •  2  =  64. 

2.3  Counting  joint  solutions 

The  last  problem  we  will  consider  in  this  section  is  to  count  those  triples  a,b,  c  £  B  m  which  satisfy  the  two 
equations: 

0(a,6)  =  0(6,c)  +  d  (13) 

and 

0(a,  b)  =  0(a,  c)  +  e  (14) 

simultaneously.  It  is  instructive  to  consider  an  example. 

Example  6  Let  in  =  |  p  =  11,  a  =  0110001011,  d  =  00010101111,  and  e  =  00110001101.  Recalling 
the  characteristic  sequence  notation  from  Section  1.3,  \<j  =  ,/'i-s  i-s:;/:;,/2/i  -so-  Then  the  simultaneous 

equations  that  must  be  satisfied  arc: 

0(a,  6)  =  0(6,  c)  +  d  &(a,  b)  =  0(a,  c)  +  e 

6o  =  co  +  1  60  =  c0  +  1  s  o 

6i  =  ci  +  1  +  k0  6i  =  ci  +  0  +  (o  si 

ao  =  6o  +  1  +  k\  ciq  =  ciq  +  1  +  Ci  /o 
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b-2  =  C-2  +  1  +  &2 
ftl  =  A4  +  0  +  A3 
ft2  =  ^2  +  1  +  A4 
a3  =  A3  +  0  +  A;  5 

^3  =  c3  +  1  +  A’6 
A4  =  C4  -)-  0  T  A7 
ft4  —  A4  T  0  T  Ag 
0  =  0  +  Ag 


&2  =  c2  +  1  +  ^2 
ftl  =  di  +  0  +  £3 
(I2  =  (I2  +  0  +  £4 

a3  =  a  3  +  0  +  £5 
A3  =  C3  +  1  +  £6 

64  =  C4  +  1  +  £7 
ft  4  =  ft  4  — | —  0  — I —  f8 

0  =  0  +  £9 


«2 

/l 

h 

h 

«3 

s4 

U 


In  the  above  set  of  equations,  kf  is  the  carry  from  the  tth  to  (7  +  I  )st  equation  in  0(7/.  b)  =  0(A,  c)  +  d 
whereas  £t  is  the  caiTy  from  the  tth  to  the  (t  +  l)sf  equation  in  0(7/.  b)  =  0(7/.  c)  +  t.  We  will  refer  to 
these  two  sets  of  equations  as  the  d-system  and  the  e-system.  Also,  we  will  let  B  denote  the  number  of 
fi -equations  in  the  system  above.  Note  that  B  =  m  if  2m  <  p. 

As  we  will  see,  it  is  seldom  the  case  that  there  are  any  simultaneous  solutions  to  equations  (13)  and  (14). 
The  next  result  states  that  even  if  there  are  simultaneous  solutions,  there  are  not  very  many. 

Theorem  2.3  The  number  of  simultaneous  solutions  to  equations  (13)  and  ( 14 )  is  less  than  or  equal  to  2  ~B 
times  the  number  of  solutions  to  equation  ( 14). 

Proof:  Suppose  there  is  a  simultaneous  solution  to  equations  (13)  and  (14).  Then  the  s  ,  -equations  determine 
the  values  of  b0,  Ai, . . .  ,.Ab_i.  To  this  simultaneous  solution  of  equations  (13)  and  (14)  we  can  correspond 

2b  solutions  to  equation  (14)  which  have  the  same  A  ;  and  ct  but  where  the  choices  of  //0.  // 1 . a  h-  \  range 

over  all  possibilities.  □ 

One  might  ask  whether  there  are  instances  in  which  the  number  of  simultaneous  solutions  to  equations 
(13)  and  (14)  is  exactly  2~B  times  the  number  of  solutions  to  equation  (14).  The  next  result  tells  us  that  this 
the  case  when  d  =  e. 

Definition  2.3 

Let , S'  denote  the  set  of  solutions  to  equation  (14).  We  say  two  solutions  (a  f1),  A^1),  cd))  and  (ft^2^,  b^2\  c^2'1)  £ 
S  are  equivalent  if  A*1)  =  b^  and  C(D  =  C(2). 

It  is  straightforward  to  see  that  every  equivalence  class  has  size  n  and  that  equivalence  classes  arc 
indexed  by  pairs  A,  c  £  Bm. 

Theorem  2.4  If  d  =  e,  then  there  is  exactly  one  solution  to  equation  (13)  in  every  equivalence  class  of 
solutions  to  equation  ( 14 ). 

Proof:  Consider  the  equivalence  class  indexed  by  the  pair  A,  c.  It  is  clear  that  there  is  at  most  one  solution 
(a,  A,  c)  to  equation  (13)  in  that  equivalence  class  because  n is  determined  by  equation  f  i.  It  remains  to 
show  that  there  is  at  least  one  solution. 

Consider  the  process  of  solving  for  the  a  ;  and  the  carries  kt  in  equations  of  type  (13)  starting  with  A,  c 
which  gives  (along  with  any  a)  a  solution  to  equation  (14).  The  thing  we  need  to  check  is  that  the  carries  kf 
we  get  in  the  equations  of  type  (13)  are  identical  to  the  carries  £f  we  get  in  equations  of  type  (14).  We  see 
this  by  induction  on  t. 

Assume  that  kt- 1  =  C- 1  ■  There  are  two  cases  to  consider.  First  assume  equation  t  is  labelled 
so  that  in  system  (13)  the  tth  equation  is  A;  =  ct  +  dt  +  1  and  in  system  (14)  the  tth  equation  is 
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bi  =  r;  +  (I i  +  l  j  _  \ .  In  this  situation  it  is  clear  that  will  be  equal  to  Ct.  Next,  assume  that  equation  /  is 
labelled  f; .  In  this  case  the  tth  equation  in  system  (13)  is  a;  =  bi  +  df  +  kt-\  whereas  the  tth  equation  in 
system  (14)  is  <n  =  a;  +  df  +  Ct-i-  By  Lemma  2.3  we  have  Ct  =  (t-i  =  dt.  By  our  induction  hypothesis, 
kt- 1  =  (f_i  =  dt.  Because  kt-\  =  df  we  have  /,•/  =  df  so  /,•/  =  Cf  which  completes  the  induction  step  and 
finishes  the  proof.  □ 

Corollary  2.5  Let  notation  be  as  in  Theorem  2.3.  Then: 

1.  The  number  of  triples  (a,b,c)  which  are  simultaneous  solutions  to  0 (a,  b)  =  0(6,  c)  +  d  and 
0(a,  b)  =  0(a,  c)  +  d  is  (Q  iq. 

2.  The  number  of  simultaneous  solutions  can  be  computed  in  O  ( p )  steps. 

The  above  results  show  that  there  are  not  very  many  simultaneous  solutions  of  equations  (13)  and  (14). 
The  next  results  indicate  that  in  most  instances  there  are  no  simultaneous  solutions. 

Suppose  there  exist  simultaneous  solutions  to  equations  (13)  and  (14).  From  our  previous  analysis,  we 
know  a  number  of  things. 

a)  e  must  be  consistent  on  0-runs  of  a. 

b)  In  equation  (14)  the  carry  into  any  0-run  and  carry  out  of  that  0-run  must  both  match  the  value  of  e  on 
that  run. 

As  a  first  test  to  whether  there  exist  simultaneous  solutions  to  equations  (13)  and  (14),  conditions  a)  and  b) 
can  be  checked  in  O(p)  steps.  We  are  now  going  to  focus  on  1-runs. 

Suppose  that  equations  u,u  +  1, . . . ,  u  +  j  —  1  constitute  a  1-run  and  that  these  equations  arc  labeled 
si,  Sj_|_i , . . . ,  .  Let  7 ,S,e  be  the  numbers  with  binary  expansions  given  below: 

f  —  b  f  i+l  '  ' 

7  =  cici+l  ■  ■  'ci+j- 1 
S  —  fZufZu_|_i  •  •  *  d n+j—  i 

£  -  ^-14^-14+1  *  *  ‘  ^U-\-j  —  l  • 

By  comparing  equations  si, . . . ,  s;+j_i  in  equations  (13)  and  (14)  we  see  that: 

f  =  7  +  S  +  ku- 1  =  7  +  £  +  (u,-i .  (15) 

Also  i  is  specified  to  be  the  consistent  value  of  e  on  preceding  0-run  and  7  must  be  chosen  so  that 
7  +  e  1  is  less  than  2-'  iff  the  value  on  e  on  the  subsequent  0-run  is  0.  From  equation  (15),  the  following 
result  follows  immediately. 

Theorem  2.6  Let  S  and  s  be  the  numbers  whose  binary  expansions  are  given  by  the  binary  digits  of  d  and 
e  on  a  1-run  of  a  as  above.  If  there  are  simultaneous  solutions  to  equations  (13)  and  (14)  then  5  and  s  must 
differ  by  no  more  than  1. 

More  precisely,  we  must  have  one  of  the  following  four  cases: 

L  5  =  e  +  Cu- 1. 

Note:  In  this  case  we  also  must  have  1  =  0. 
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=  1. 


2.  S  —  +  4— i  —  1. 

Note:  In  this  case  we  also  must  have  4  - 1 

3.  S  =  0 ,t-  =  2P  -  l,£u_i  =  1. 

Note:  In  this  case,  we  also  must  have  that  4  - 1  =  0  and  that  e  is  consistently  1  on  the  next  0-run. 

4.  S  =  2P  -l,s=  4-i  =  0. 

Note:  In  this  case,  we  also  must  have  that  4_i  =  1  and  that  e  is  consistently  0  on  the  next  0-run. 


Theorem  2.6  gives  another  O  (p)  test  which  can  determine  that  there  are  no  simultaneous  solutions  to  equa¬ 
tions  (13)  and  (14).  Note  that  if  we  assume  d  and  e  arc  chosen  randomly,  then  Theorem  2.6  together  with 
condition  a)  show  that  the  probability  that  there  exist  simultaneous  solutions  to  equations  (13)  and  (14)  is 
no  more  than  r~R\  where  B  is  the  number  of  /'  -equations,  U  is  the  number  of  .s, -equations  and  R 

is  the  number  of  runs  of  a.  Note  that  B  +  U  =  m  in  {2  m .  />}.  Alternatively,  if  we  have  some  freedom  to 
choose  rZ,  e  and  a,  then  the  conditions  given  in  a)  and  Theorem  2.6  can  be  used  to  insure  that  there  arc  no 
simultaneous  solutions  to  equations  (13)  and  (14).  We  will  return  to  this  important  point  in  our  later  paper 
on  minimizing  the  number  of  cache  misses. 

It  seems  unlikely  to  us  that  there  exists  an  algorithm  which  is  polynomial  in  m  or  linear  in  p  which 
determines  the  exact  number  of  simultaneous  solutions  to  equations  (13)  and  (14).  Just  to  conclude,  we 
examine  the  case  given  in  Example  6  just  to  point  out  some  of  the  complexities  of  this  problem. 

Turning  to  the  set  of  equations  given  in  Example  6,  we  first  examine  whether  the  conditions  set  out  in  a) 
and  b)  hold.  It  can  be  seen  that  e  is  consistent  on  0-runs  with  value  1  on  /0,  value  0  on  /i ,  /2,  /3  and  value 
1  on  /4.  Condition  b)  thus  implies  that  4  4  1-4  4  4  4  0  and  4  =  4  =  0. 

To  now  consider  the  constraints  given  by  Theorem  2.6,  we  must  look  at  1-runs.  For  the  1-run  s0,  sj,  we 
have  4  =  ^  =  3  and  f_i  =  0.  So  we  are  in  Case  1.  This  implies  that  k- 1  =  0  and  that  4  =  1.  This  gives 
a  constraint  on  7  =  c0ci  i.e.,  7  +  e  >  4.  This  constraint  on  7,  which  comes  from  consideration  of  the  e 
equations,  which  implies  that  k  1  =  1. 

Moving  now  to  the  1-run  s-2,  we  have  that  S  =  1,  e  =  0  and  l2  =  1.  So  we  are  again  in  Case  1  which 
implies  that  k2  =  0.  But  now  we  have  an  inconsistency:  it  is  impossible  to  have  4  =  I  and  4  =  0.  So 
there  arc  no  simultaneous  solutions  to  equations  (13)  and  (14)  in  the  case  given  in  Example  6. 

This  particular'  example  gives  a  flavor  for  the  complex  inteiplay  that  can  take  place  between  the  con¬ 
straints  imposed  by  the  d  equations  and  those  imposed  by  the  e  equations.  At  this  time,  we  do  not  know  a 
fast  algorithm  to  determine  the  number  of  simultaneous  solutions  exactly. 

3  Incorporating  Cache  Block  Size 

In  the  last  section,  we  devised  fast  algorithms  to  compute  the  number  of  solutions  to  systems  of  equations 
of  the  form 

©(a,  b)  =  0(4  c)  +  d  mod  2P 

and 

0(a,6)  =  0(a,c)  +  c?  mod  '2P . 

In  practice,  we  will  need  to  extend  these  algorithms  to  enumerate  solutions  to  a  slightly  different  pair  of 
equations.  Usually  2A  memory  locations  fit  into  a  cache  block,  represented  by  the  denominator  in  the 
following  equations. 
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Thus,  the  equation  has  to  be  taken  mod2p-\  In  practice,  A  often  equals  2  or  4.  We  show  the  case 
A  =  2  to  provide  the  case  distinction  in  full  detail;  the  extension  for  A  £  N  g  is  straightforward,  but  requires 
consideration  of  more  cases  for  A  >  2. 


and 


^e(q,6)  +qj  =  |^0(6,c)+/jj  ^  ^_2 

,  0(a,  b)  +  a  0(a,  c)  +  /3 


-J  =  L- 


J  mod  '2P  2 


(16) 

(17) 


where  a,  (3  £  Bp. 

In  this  section  we  sketch  methods,  based  on  the  ideas  and  algorithms  developed  in  Section  2,  to  compute 
the  number  of  solutions  to  equations  (16)  and  (17).  We  will  take  the  two  equations  in  turn,  starting  with 
equation  (17)  because  much  of  what  we  find  there  can  later  be  reused  for  the  treatment  of  equation  (16). 


3.1  Computing  the  number  of  solutions  to  equation  (17) 

To  begin  we  will  write  out  the  digits  in  the  binary  expansions  of  0(a,  b)  +  a  and  0(a,  c)  +  / 3 .  Equating 
these  expressions  gives  a  system  of  equations  E_  =  E 0,  E), . . . ,  Ep_\  where  equation  (17)  imposes  the 
requirement  that  equations  E2,  E;. . . .  i  must  be  satisfied  mod  2.  In  order  to  satisfy  equation  (17),  E0 
or  Ei  need  not  hold  mod  2.  Consider  E0  and  E\ .  They  look  like  one  of  the  following: 


(Iq  +  Q'o  —  Clo  +  00 

+  Qi  +  ho  =  cii  +  (3\  +  £q 

(18) 

CIq  +  Q'o  =  CIO  +  f3o 

+  Q'l  +  k0  =  Co  +  /3i  +  (o 

(19) 

bo  +  Q'o  =  Co  +  f3o 

+  Q'l  +  ko  =  (Iq  +  +  (o 

(20) 

bo  +  Q'o  =  Co  +  f3o 

+  Q'l  +  ko  =  Cl  +  +  (o 

(21) 

where  k0  is  the  carry  from  the  left  side  of  E0,  and  l0  is  the  caiTy  from  the  right  side  of  E0.  Case  (18)  occurs 
when  (Ji (j0  =  00,  (19)  when  cricr0  =  10,  (20)  when  cricr0  =  01,  and  (21)  when  cricr0  =  11.  The  key 
observation  is  that  the  variables  which  appear  in  these  equations  do  not  appeal-  in  any  of  the  later  equations 

E2,  . | ,  because  only  a;  can  occur  more  than  once  for  each  i  and  both  instances  of  a;  are  in 

equation  Ei. 

Our  algorithm  for  enumerating  the  solutions  to  equation  (17)  begins  with  a  loop  over  all  possible  choices 
of  values  for  the  variables  that  occur  in  E0  and  Ei  ■  So,  this  outer  loop  runs  through  4,8,8,  or  16  possibilities 
depending  on  whether  we  are  in  case  (18),  (19),  (20),  or  (21)  respectively. 

Once  values  for  these  variables  have  been  chosen,  we  compute  the  carries  l<\  and  ( \  that  are  added  to 
the  left  and  right  sides  of  E-2 .  Let 

p- 1 

a  =  k\  +  'y  ^  Q';2i_ 2 
8  =  2 
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and 


p- 1 

/*' =  fi  +  X>i2‘'~2. 

8  =  2 

Then  the  number  of  solutions  to  equation  (17)  with  the  chosen  values  for  the  variables  in  E0  and  E\  is  equal 
to  the  number  of  solutions  to 


Q1  (a1 ,  b')  +  a'  =  Q1  (a1,  c1)  +  j3'  mod  2f  2 


(22) 


where  a\  bf  c'  each  come  from  Hm_2,  Bm- \  or  Bm  depending  on  whether  cricr0  =  00, 10,  01  or  11.  Here 
0'  is  the  mixing  function  based  on  the  interleaving  a'  obtained  from  a  by  deleting  cr0  and  oq .  Let 


f  \3'  —  of  if  f  >  a' 

\  2p~2  +  13'  -  a'  if  13'  <  a' 


Then  the  number  of  solutions  to  equation  (22)  is  equal  to  the  number  of  solutions  to 


0/(a/ ,  b')  =  (a,  c)  +  d 


which  can  be  computed  using  the  AC  algorithm  in  0(m  +  p)  steps. 


With  this  generalization  we  call  the  algorithm  the  extended  AC  Algorithm. 


3.2  Computing  the  number  of  solutions  to  equation  (16) 

As  in  Section  3.1  we  will  begin  by  writing  out  expressions  for  the  digits  in  the  binary  expansions  of 

@(a,  b)  +  q  and  0(6,  c)  +  3.  This  gives  a  system  of  equations  C0,  /,) . Ep_ i  where  the  requirement  of 

equation  (16)  is  that  C2, . . . ,  i  must  be  satisfied  mod  2  (E0  and  I,  \  need  not  hold  mod  2). 

Again,  we  will  look  at  E0,  E\  and  find  that  they  have  one  of  four  possible  forms: 


uq  +  Q'o  —  bo  +  /3q 


or 


ai  +  Q'i  +  ho  —  bi  +  (3 1  +  (o  (23) 

ao  +  Q'o  =  bo  +  /3  o 

bo  +  Q'i  +  kg  =  cq  +  /3i  +  (q  (24) 


or 


bo  +  Q'o  —  Co  +  fo 
ciQ  +  Q'i  +  ko  =  bo  +  (3\  +  (q 


or 


bo  +  Q'o  —  Co  +  fo 
b\  +  Q'i  +  ko  =  ci  +  fi  +  (0 


(25) 


(26) 


where  k0  is  the  carry  from  the  left  side  of  E0,  and  l0  is  the  carry  from  the  right  side  of  E0-  Case  (23)  occurs 
when  (j\(Jo  =  00,  (24)  when  cico  =  10,  (25)  when  cico  =  01,  and  (26)  when  ciop  =  11-  Note  that  in  (24) 
and  (25)  the  variables  which  occur  in  equations  E0  and  I,  \  do  not  occur  in  E 2,  . . . ,  Ep-\,  because  only 
bi  can  occur  more  than  once  for  each  i  and  both  instances  of  b0  occur  in  E0  and  E\.  Therefore,  we  can  use 
the  same  method  we  used  in  Section  3.1  to  devise  fast  algorithms  to  compute  the  number  of  solutions. 
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Cases  (23)  and  (26)  arc  slightly  different  because  60  and  b\  may  occur  later  in  /,2-  If, . i .  In 

case  (23),  our  algorithm  has  an  outside  loop  over  the  four  possible  choices  of  values  for  a  0  and  a i .  For  each 
choice  of  these  values,  we  compute  the  carry  k\  which  is  added  to  the  left-hand  side  of  E 2 .  We  let 

p- 1 

a'  =  4k\  +  y  '  a'i'2‘ 

8  =  2 

and  we  apply  the  AB  Algorithm  to  count  the  number  of  solutions  to 

0(a,  6)  =  0(6,  c)  +  d 


where 

.  _  f  )3  —  a'  if  /3  >  a' 

\2P  +  /3  —  a'  if  /3  <  a’ . 

This  number  is  equal  to  the  number  of  solutions  to  equation  (16)  in  which  a0  and  a  \  have  the  specihed 
values. 

We  handle  case  (26)  in  a  way  quite  similar  to  (23).  We  loop  over  the  four  possible  choices  of  values  for 
c0  and  ci.  For  each  of  these  values,  we  compute  the  carry  /i  which  is  added  to  the  right-hand  side  of  E 2  and 
let 

p- 1 

i'  i/i  •  y,  i->' 

|  =  2 

and  we  apply  the  AB  Algorithm  to  count  the  number  of  solutions  to 

0(a,  b)  =  0(6,  c)  +  d 


where 

I  _  j  /31  —  a  if  f  >  a 
\  2P  +  f3'  —  a  if  f3'  <  a. 

This  number  is  equal  to  the  number  of  solutions  to  equation  (16)  in  which  c0  and  ci  have  the  specihed 
values. 

With  this  generalization  we  call  the  algorithm  the  extended  AB  Algorithm. 

4  Calculating  the  Number  of  Cache  Misses 

In  this  section  we  return  to  the  problem  of  counting  cache  misses.  Recall  that  we  are  analyzing  the  data 
layout  function  defined  in  terms  of  an  m .  m  interleaving  a  =  cr2m-i  . . .  C|  cr{)  by 

Ahk  maps  to  pi  +  © ( / ,  k) 

Bk)J  maps  to  p2  +  0(6’,  j) 

Ci  j  maps  to  p3  +  ©(/,  j ) . 

and  that  we  use  the  following  suggestive  notation: 

A  miss  is  the  number  of  cache  misses  when  accessing  an  element  of  A, 

A-B  miss  is  the  number  of  cache  misses  which  occur  when  an  element  of  A  is  accessed  which  was  in  cache 
but  was  removed  because  an  element  of  B  took  its  place, 
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A-BC  miss  is  the  number  of  cache  misses  which  occur  when  an  element  of  A  is  accessed  which  was 
previously  in  cache  and  such  that  both  an  element  of  B  and  an  element  of  C  have  taken  its  place  in  cache 
since  it  was  most  recently  there,  and  so  on. 

Considering  the  inclusion-exclusion  property  of  set  intersections,  the  task  is  to  enumerate  the  following 
types  of  misses: 

A  miss  =  A-A  miss  +  A-B  miss  +  A-C  miss  -  A-AB  miss  -  A-BC  miss  -  A-AC  miss  +  A-ABC  miss 

B  miss  =  B-A  miss  +  B-B  miss  +  B-C  miss  -  B-AB  miss  -  B-BC  miss  -  B-AC  miss  +  B-ABC  miss 

C  miss  =  C-A  miss  +  C-B  miss  +  C-C  miss  -  C-AB  miss  -  C-BC  miss  -  C-AC  miss  +  C-ABC  miss 

Figure  1  shows  this  for  A  miss. 

However  in  the  special  case  of  matrix  multiplication,  some  misses  need  not  be  considered;  in  particular 
there  arc  no  A-A,  A-AB,  A-AC,  or  A-ABC  misses  because  unique  elements  .1;,/,  arc  accessed  in  the  two 
outermost  loops  only.  A  method  to  derive  the  types  of  misses  that  arc  required  in  the  more  general  case  of 
programs  other  than  matrix  multiplication  is  subject  of  future  work.  We  have  also  proven  in  Section  2  that 
the  number  of  simultaneous  solutions  is  very  small. 


Figure  1:  Amiss:  inclusion-exclusion  property 

Please  note  that  we  are  not  including  every  type  of  miss  in  our  analysis,  but  are  including  a  case  that 
represents  each  of  the  key  ideas  involved  in  counting  the  number  of  cache  misses. 

Throughout  this  section,  we  will  continue  to  assume  that  2m  <  p  to  simplify  the  exposition.  In  the  case 
that  '2  m  >  p,  the  following  changes  must  be  made  to  the  analysis  in  this  section.  Each  time  an  iteration 
point  (  /.  k .  j )  is  counted  as  a  miss,  then  only  initial  segments  of  the  binary  expansions  of  i,  k  and  j  arc 
determined. 

There  arc  no  constraints  on  how  these  initial  segments  are  extended  to  give  complete  binary  expansions 
of  i,  k  and  j.  So  each  miss  enumerated  in  this  section  must  be  multiplied  by  2 11  where  D  is  the  total 
number  of  undetermined  binary  digits  in  i,  k  and  j  (the  number  D  depends  on  which  kind  of  miss  is  being 
enumerated  and  so  must  be  determined  on  a  case  by  case  basis). 

4.1  Computing  A  miss 

In  this  subsection  we  show  how  to  efficiently  compute  A  miss.  An  array  element  .1  ;j,  will  be  accessed  at  the 
n  iteration  points  (/.  k,  w),  where  0  <  w  <  n  —  1.  Suppose  that  we  have  a  cache  miss  when  .1  ;j.  is  accessed 
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at  the  iteration  point  (/,  k,  j).  As  the  same  element  of  A  is  accessed  throughout  the  innermost  loop,  there 
arc  no  A-A  misses.  Since  we  arc  using  the  lexicographic  ordering  i  y  k  y  j,  the  iteration  point  (/'.  k,  j)  is 
immediately  preceded  by  the  iteration  point  (/,  k,j  —  1)  at  which  the  array  element  .1  ;j,  is  accessed.  Thus 
at  the  iteration  point  1)  there  must  be  a  memory  access  of  an  element  of  B  or  C  which  occupies 

the  same  cache  set  as 

Although  possibly  negligible,  there  could  also  be  a  small  number  of  contributions  to  A  miss  along  the 
boundary  of  the  innermost  loop.  The  array  element  Ahk- i  is  accessed  during  the  (  /.  k  —  1,  n  —  1)  iteration 
step  and  the  array  element  .1;j,  is  accessed  during  the  following  iteration  step,  (/.  k,  0).  Suppose  Ai^-i 
and  .1 ;  j.  occupy  the  same  cache  word,  then  a  cache  miss  occurs  if  there  is  a  memory  access  of  an  element 
of  B  or  C  that  maps  to  the  same  cache  set  as  |  at  (/.  k  —  1,  n  —  1).  It  will  be  the  case  that  the  array 
element  Ahk  existed  in  the  cache,  but  was  removed  by  an  access  to  Bk-i,n-i  or  C-i,n- 1  at  iteration  step 
( / ,  k  —  1 ,  n  —  1 ) . 

We  can  now  examine  A-B  miss  and  A-C  miss  separately. 


4.1.1  Computing  A-B  miss 

During  the  (/.  k,j)  iteration  step  we  form  the  product  .1 •  B^j  and  add  it  to  Cij.  When  we  do  so,  we 
access  these  three  pieces  of  information  in  the  order  Ai,k  followed  by  B^j  followed  by  Cij .  So,  in  order 
for  this  cache  miss  to  contribute  to  A-B  miss,  it  must  be  the  case  that  the  array  element  .1 was  removed 
from  cache  at  the  previous  iteration  step  when  the  array  element  B^j- 1  was  accessed,  i.e.,  .1;j,  and  B^j- 1 
occupy  the  same  word  in  cache.  This  is  equivalent  to: 


+  Q{i,  fc)  j  =  +  ®(k,j  -  1)  j 


mod  2P  2 


(27) 


where  this  equation  is  taken  mod|  =  '2P~2.  So  A-B  miss  is  equal  to  the  number  of  solutions  (/,  k,j)  to 
equation  (27)  with  0  <  i  <  n  —  1,  0  <  k  <  n  —  1  and  1  <  j  <  n  —  1.  The  number  of  solutions  to 
equation  (27)  is  computed  by  the  Extended  AB  Algorithm. 

To  count  A-B  misses  along  the  boundary  of  the  innermost  loop,  we  determine  if  Ai^-i  and  .1  ;j,  occupy 
the  same  cache  word 

|^t  +  ~  !)  j  =  +  Q(d  k)  j  (28^ 

and  if  so,  we  check  if  an  access  to  the  array  element  Bk~ i,n_i  causes  a  cache  miss 


^t  +  ~  !)  j  =  ^2  +  Q(£  ~  »  ~  l)j  mod  2p- 2 

incrementing  the  A-B  miss  count  if  both  equations  arc  satisfied. 


4.1.2  Computing  A-C  miss 


By  the  same  reasoning  as  above,  the  number  of  cache  misses  that  contribute  to  A-C  miss  is  the  number  of 
solutions  to 


=  +  @(i,j  -  1)  ^ 


mod  2P  2 


(29) 


where  this  equation  is  taken  modulo  |  =  2P~2  and  /.  j.  k  are  constrained  to  lie  in  the  intervals  0  <  /  < 
n  —  1,  0  <  k  <  n  —  1  and  1  <  j  <  n  —  1.  The  number  of  solutions  to  equation  (29)  is  computed  by  the 

Extended  AC  Algorithm. 
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To  count  the  contributions  to  A-C  miss  along  the  boundary  of  the  innermost  loop,  we  check  if  .1  | 

and  .  I ;  j.  occupy  the  same  cache  word  exactly  as  in  equation  (28),  and  if  so  we  determine  if  an  access  to  the 
array  element  1  causes  a  cache  miss 

^/ii  +  0(/,fc-  l)j  =  ^3  +  0(/,n-  1)^  mod  2P_ 2 

incrementing  the  A-C  miss  count  if  both  equations  are  satisfied. 

4.1.3  Computing  A-BC  miss 

We  will  count  zero  A-BC  misses.  The  conditions  in  Section  2.3  can  be  checked  in  O(p)  steps  to  determine 
whether  this  count  is  accurate.  As  proved  in  Section  2.3,  even  if  there  are  instances  of  such  misses,  their 
number  is  small  -  less  than  2~T  of  the  total  number  of  misses,  where  7  is  the  number  of  ones  in  the  set 
{<j2,  <j3,  . . . ,  op-  i }.  In  fact,  on  the  basis  of  this  result,  we  are  setting  all  terms  requiring  the  simultaneous 
solving  of  equations  (e.g.,  B-AB  miss,  B-BC  miss,  B-AC  miss,  B-ABC  miss,  C-AB  miss,  C-BC  miss, 
C-AC  miss,  C-ABC  miss)  to  zero. 

4.2  Computing  C  miss 

The  quantity  C  miss  counts  the  number  of  iteration  points  (  /.  /,-.  j)  with  k  >  0  such  that  the  matrix  element 
C[i.  k,  j]  is  not  in  cache  thereby  causing  a  miss.  As  a  first  step,  we  will  determine  L[i,  k,  j]  which  denotes 
the  most  recent  iteration  step,  prior  to  (/,  k,j)  at  which  C'[/,  k,j]  was  in  cache.  Note  that  /.  [/.  A;,  j]  is  the  most 
recent  iteration  step  when  an  element  of  C  was  accessed  that  occupies  the  same  cache  word  as  C'[/,  k,  j].  If 
we  write  L[i,  k,  j]  =  (/',  k\  j ')  this  is  equivalent  to: 

1^3  +  0(Tj)j  =  ^/i3  +  Q(/7,/)  j  (30) 


4.2.1  Computing  C-A  miss 

The  solution  to  equation  (30)  depends  on  the  form  of  0  and  so  at  this  point  the  analysis  must  break  into 
cases.  There  are  four  cases  to  consider  depending  on  whether  cricr0  =  00,  01,  10  or  11.  We  will  write  out 
details  in  two  of  the  cases  which  represent  the  technical  problems  that  come  up  in  the  other  two  cases.  The 
details  of  the  remaining  two  cases  are  left  to  the  reader. 

Case  1:  o\<jq  =  00. 

In  this  case,  the  four  elements  of  C  which  occupy  the  same  cache  word  as  C'[/,  k,  j]  =  C;j  are  usually 
Cij-U,  Cij-U+ 1,  Cij-U- 1_2,  Ci:J-u+3  where  p3  +  =  u  mod  4.  The  modifier  'usually'  refers  to  the 

observation  that  not  all  of  these  elements  of  C  might  exist  in  the  extreme  cases  where  j  <  u  or  j  —  u  +  3  >  n. 
But  as  long  as  u  >  0  and  j  >  0,  ( is  in  the  same  cache  word  as  d  .  In  this  case,  C[i .  k,  j]  is  brought 
into  cache  at  the  preceding  iteration  step  —  1)  and  so  L(i,k,j)  =  —  1). 

If  j  =  0  or  u  =  0  then  L  (/,  k,j)=  (/,  k  —  1,  j  —  u  +  3)  unless  j  —  u  +  3  >  n.  In  that  case  (;/  =  0  and 

j  —  u  +  3  >  n),  L(i ,  k,j)  =  (/,  k  —  1,  n  —  1).  To  summarize:  if  a  =  p3  +  mod  4,  then 

ii  if  j  >  0  and  u  >  0 

(/,  k  —  1,  j  —  u  +  3)  if  j  =  0  or  {  u  =  0  and  j  —  u  +  3  <  n  } 

(/,  k  —  1,  n  —  1)  if  u  =  0  and  j  —  u  +  3  >  n. 

Now  C-A  miss  is  the  number  of  pairs  of  iteration  points  ( / .  /,• .  j ) .  (.r  ,z,y)  such  that 

L(i,k,j)  <  (x ,z,y)  <  ( i,k,j )  (31) 
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and 


(32) 


^1  +  0(;r,;)j  =  ^3  +  0(/,  j)  ^  ^  ^_2 

To  clarify  the  connection  between  C-A  miss,  equation  (31),  and  equation  (32)  note  that  equation  (32)  states 
that  A[x,  z,  y]  and  C[i.  k ,  j]  occupy  the  same  cache  word  and  equation  (31)  states  that  iteration  step  (x.  :  .  y ) 
occurs  sometime  between  the  iteration  step  ( i,k,j )  and  the  previous  iteration  step  when  C[i,k,j]  was 
brought  into  cache. 

We  now  break  our  analysis  into  two  cases  depending  on  the  exact  form  of  L[i,  k,j].  If  L[i,  k,j]  = 
(/,  k,j  —  1)  then  we  must  have  (,r,  z,  y)  =  (/,  k,j).  Also,  if  u  =  0  and  j  +  3  >  n  so  that  L[i,  k,j]  = 
(/,  k  —  1,  n  —  1)  then  equation  (31)  becomes  (/,  k  —  1,  n  —  1)  <  (,r,  z,  y)  <  (/.  k,  j).  This  cannot  be  satisfied 
with  z  =  k  —  1  because  we  would  then  need  n  —  I  <  y.  So  wc  must  have  z  =  k  and  y  =  j.  This  is  a  second 
instance  in  which  (,r,  z,  y)  must  be  equal  to  (/,  k,  j).  In  this  case,  equation  (32)  states: 

[m  +  0MOJ  =  [w  +  Q(<,j)j  mod  v-2 

Solutions  to  this  equation  are  enumerated  by  the  Extended  AC  Algorithm. 

If  j  =  0  or  ti  =  0  and  j  —  u  +  3  <  n  —  1  then  equation  (31)  states  that  (/,  k  —  1,  j  —  u  +  3)  <  (,r ,  z,y)  < 
(  /.  k,  j).  We  deduce  that  x  =  i  and  that  :  is  equal  to  either  k  —  1  or  k.  Also,  in  this  case  we  cannot  satisfy 
the  inequality  j  —  u  +  3  <  y  <  j  so  we  must  have  /.•  1 .  Thus  the  contribution  to  C-A  miss  made  in 

this  case  is  number  of  solutions  to  equation  (32)  which  is: 

|^l  +  0(/,fc-l)j  =  ^3  +  0(/,j)j  mod  2^_  2 

This  is  equivalent  to  enumerating  solutions  to 

Pi  +0(/,fc') ,  =  ,  P3  +  Q(q i) ,  ,  p- 2 

4  J  L  4  J 


where  0  <  /'  <  n  —  1,  0  <  k'  <  n  —  2,  0  <  j  <  n  —  1.  Solutions  to  this  equation  arc  enumerated  by  the 

Extended  AC  algorithm. 

This  completes  Case  1  in  our  analysis  of  C-A  miss. 


Case  2:  <7i<7o  =  10. 


The  fundamental  difference  between  the  analysis  in  this  case  and  the  analysis  in  Case  1  is  the  relationship 
between  cache  words  and  the  arrays  A ,  /i.  C.  In  particular,  the  elements  of  C  that  occupy  the  same  cache 
word  as  Cyj  arc 

C'i—v,j  —  u  i  1  /'+  I  .j  —  n  i  1  /'+  I  .j  —  //+  I  (33) 

where 

P3  +  @(i,j)  =  v  mod  2 


and 


p3  +  @{i,j)  -  v 
2 


mod  2 


The  analysis  now  parallels  the  analysis  in  Case  1  but  with  changes  in  some  details  to  reflect  the  cache  word 
structure  given  in  equation  (33). 

If  j  >  0  and  u  >  0  the  L[i,  k,  j]  =  (/.  k,  j  —  1)  and  we  proceed  as  in  Case  1.  If  u  =  0  and  j  =  n  —  1 
then  L[i,  k,  j]  =  (  /.  k  —  1,  n  —  1)  =  (  /.  k  —  1,  j).  In  both  these  cases,  if  a  cache  miss  is  caused  by  the  access 
of  Ax,z  removing  Cyj  at  iteration  step  (,r ,  z,y),  where  L[i.,k,j]  <  (x ,  z,y)  <  (i,k,j),  then  we  must  have 
(x,  z,y)  =  ( /' .  /,• .  j ) .  These  instances  arc  enumerated  as  in  Case  1  by  the  Extended  AC  algorithm. 
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If  j  =  0  or  if  u  =  0  and  j  <  n  —  1  then  L(i,  k,j)=  (/,  k  —  1,  j  —  u  +  1).  In  this  case,  (.r ,  z,y)  must 
equal  (/,  k  —  1,  j  —  u  +  1)  and  we  enumerate  these  instances  as  in  Case  1.  This  completes  the  computation 
of  C-A  miss  in  Case  2. 

The  computation  of  C-A  miss  in  the  remaining  two  cases  is  similar. 

Note  that  C-C  miss  can  be  handled  in  a  manner  similar  to  C-A  miss.  There  is  the  same  consideration  of 
the  most  recent  iteration  step  at  which  an  element  of  C  was  addessed  that  occupies  the  same  cache  word  as 
C\i.  k,  j],  and  the  analysis  breaks  into  the  same  four  cases  depending  on  ct  \  cr{).  The  key  difference  is  that  in 
this  case,  an  access  to  Cx,y  interferes  with  an  access  to  ( where  x,  y  are  as  in  equation  (31). 

4.2.2  Computing  C-B  miss 

To  compute  C-B  miss,  we  need  to  compute  the  number  of  pairs  of  triples  (/.  k,j%  (x,  z,  y)  which  satisfy 
equation  (31)  such  that  C[i.  k,j]  =  C and  B[x,  z,  y]  =  Bz>y  occupy  the  same  cache  block.  This  latter 
condition  is  equivalent  to: 


LW  +  0Mj  =  L^2  +  e(^y)j  mod  2,_2 


(34) 


As  in  the  previous  subsection,  if  =  [At3+®(iJ)  j,  then  equation  (31)  implies  that  (x,  z,y)  = 

(  /.  k,  j ).  In  that  case,  equation  (34)  is  equivalent  to: 


[w  +  0MJ  =  lw±e(Mj  mod  r-2 


(35) 


Let  a  be  the  interleaving  obtained  from  a  by  interchanging  0’s  and  l’s  and  let  0  denote  the  mixing  function 
determined  by  a.  Note  that  for  any  pair  of  non-negative  integers  v,  w: 

0(n,  w)  =  @(  w,  v). 


Thus,  we  can  re-write  equation  (35)  by 


mod  -y-2 


(36) 


The  Extended  AC  Algorithm  counts  solutions  to  equation  (36)  which  gives  us  a  fast  algorithm  to  count 
contributions  to  C-B  miss  that  arise  in  the  instances  where  (x,  z,y)  =  (i,k,j). 

The  remaining  contributions  to  C-B  miss  come  from  solutions  to  equation  (34)  in  the  cases  where  either 
3  =  0  or 

^3  +  Q(fJ  ~  1)  j  |^3  +  Q(h  j)j 

The  analyses  of  these  two  cases  are  somewhat  different  and  so  we  do  them  separately. 

Consider  the  case  where 

+  Q(tj  -  i)  j  ^3  +  Q(t  j)j 

This  can  occur  in  one  of  three  different  ways.  If  <r0  =  1  then  this  condition  is  equivalent  to  //  3  +  0(/.  j)  = 
0  mod  4.  If  <ti<jo  =  10  then  this  condition  is  equivalent  to  p  3  +  0(i,j)  =  0,1  mod  4.  Lastly,  if 
ct  |  ct o  =  00  then  this  condition  always  holds. 

In  this  case,  we  have  L(i,  k,  j)  =  (  /.  k  —  1,  j  +  1)  and  so  equation  (31)  becomes 


{i,k-  l,j+  1)  <  {x,z,y)  < 
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At  first  glance,  the  enumeration  of  solutions  to  equation  (34)  appeal's  to  be  problematic.  Although  we  can 
deduce  that  :  is  either  k  —  1  or  k,  we  have  very  little  control  on  j.  So  equation  (34)  contains  four  variables 
that  are  essentially  independent.  After  some  simplification  of  the  problem,  we  will  see  that  this  is  in  fact  an 
advantage  and  makes  the  enumeration  of  solutions  particularly  easy. 

To  count  solutions  to  equation  (34)  we  first  loop  over  all  possible  values  for  the  first  two  digits  in  the 
binary  expansions  of  0  ( / .  j )  and  0  ( : .  y ) .  This  will  involve  specifying  the  first  two  digits  of  i,  or  the  first 
digit  of  i  and  the  first  digit  of  j,  or  the  first  two  digits  of  j  depending  on  the  values  of  a  icr0.  Let  if  j '  be 
the  remaining,  unspecified  digits  of  i  and  j.  Define  z'  and  y'  similarly.  Also,  define  0'  to  be  the  mixing 
function  associated  with  a'  =  .. .  cr3cr2. 

Having  specified  the  first  two  digits  of  0 ( / .  j )  and  © ( : .  ij  )  we  next  perform  the  following  steps: 

1)  Check  whether  ^(3+eh’-?~1)  j  =  |^3+Qpj)  j  jf  so^  gQ  qlc  nCxt  step  in  the  loop  (here  loop  refers  to  the 
outermost  loop  whose  steps  are  indexed  by  the  choices  for  possible  first  two  digits  of  O ( / .  j)  and  O ( : .  //)). 
Otherwise,  continue  to  step  2). 

2)  Let  p'3  be  [kf-\  +  e\  where  is  the  binary  carry  from  the  first  to  the  second  binary  digits  in  //3  +  0(/.  /  ). 

3)  Let  //(  be  [ ^ J  +  iq  where  iq  is  the  binai'y  caii'y  from  the  first  to  the  second  binai'y  digits  in  //  2  +  0(:.  y). 

The  number  of  solutions  to  equation  (34),  given  the  specified  digits  in  and  O ( : .  y)  is  equal  to  the 

number  of  solutions  to 

/4  +  ©'(u/)  =  lh  +  ©'(s',  y')  mod  2P~2  (37) 

Define  d  by: 

7  _  /  ll3  -  il2  if  ll3  >  ll2 

l  4  +  ^3  —  ll  2  1  f  ll  3  A  P  -2 

Then  the  number  of  solutions  to  equation  (37)  is  equal  to  the  number  of  solutions  to  equation  (38): 

d  +  0/(//,  j')  =  0/(:/,  y')  mod  2P~2  (38) 

Enumerating  solutions  to  equation  (38)  is  straightforward.  Let  d'  consist  of  the  first  2m  —  2  binary  digits  of 
d.  First,  examine  the  remaining  binai'y  digits,  7,  for  2m  —  2  <  l  <  p  —  3.  Unless  these  remaining  binary 
digits  are  identical,  there  are  no  solutions  to  equation  (38).  If  they  are  identical  and  equal  to  e,  then  the 
number  of  solutions  to  equation  (38)  is  equal  to  the  number  of  solutions  to: 

d!  +  0'(*',  j')  =  y')  mod  22m“2  (39) 

for  which  the  terminal  carry  on  the  left  hand  side  of  equation  (39)  is  e.  The  key  observation  is  that  the 
number  of  solutions  to  equation  (39)  is  equal  to  the  number  of  pairs  (/',  j')  which  result  in  terminal  carry  e 
-  once  such  a  pair  has  been  specified  there  is  a  unique  choice  of  p',  y1  which  satisy  equation  (39). 

In  our  design  of  the  AC  Algorithm  (Section  2),  we  exactly  determined  the  number  of  pairs  if  j'  which 
give  terminal  carry  e  on  the  left-hand  side  of  equation  (39).  This  number  is: 

J  22m— 2  _  d,  -j  e  =  o 
\  d'  if  e  =  1. 

This  finishes  our  enumeration  of  solutions  to  equation  (34)  in  the  case  that  [  Ll±®iLL22.j  ^  [At3+®(iJ>  j .  To 
evaluate  the  complexity  of  this  enumeration  algorithm:  there  is  an  outer  loop  through  the  sixteen  possible 
choices  for  the  first  two  binai'y  digits  of  O  ( / .  j )  and  O  ( : .  y ) .  Within  that  loop,  we  need  to  determine  d  and 
check  whether  the  binai'y  digits  of  d  are  consistent  between  2m  —  2  and  p  —  3.  That  takes  O(p)  operations. 
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Then  we  need  to  compute  d!  which  can  be  done  in  O(m)  operations.  Thus  the  total  complexity  in  this  case 

is  O(p). 

To  complete  the  calculation  of  C-B  miss  we  need  to  consider  the  case  where  j  =  0.  By  the  same  series 
of  reductions  we  used  in  the  previous  case,  this  reduces  to  enumerating  solutions  to 

d'  +  ©'(*',  0)  =  y)  mod  22m“2  (40) 

As  before,  equation  (40)  has  no  solutions  unless  the  binary  digits  d  c,  l  =  p  —  3, . . . ,  2m  —  1,  2m  —  2  arc 
identical.  If  they  arc  identical,  let  e  £  {0, 1}  represent  their  common  value.  In  that  case,  the  number  of 
solutions  to  equation  (40)  is  equal  to  the  number  of  i'  for  which  d!  +  ©'(/',  0)  has  terminal  carry  e. 

To  efficiently  compute  this  number,  first  scan  the  left  hand  sides  of  the  equations  in  equation  (40)  from 
bottom  to  top.  Consider  those  left  hand  sides  which  have  the  form  c/',  +  0  +  i  (where  0  comes  from  the 
j'  =  0  component  in  ©'(/',  0),  i.e.,  oy+2  =  1)-  If  d'c  =  1  then  erase  that  equation  as  you  know  that  whatever 
carry  comes  in,  the  same  carry  will  go  out.  If  d'c  =  0  then  stop  your  scan.  You  know  that  k ,  will  have  to 
be  0.  So  we  can  staid  constructing  solutions  from  the  next  equation  on  without  regal'd  to  any  earlier  binary 
digits  of  i' .  To  this  end,  let  r  be  the  number  of  i'a  which  occur  in  equation  in  equation  (40)  that  comes  before 
your  stopping  point.  Let  d"  be  the  digits  of  <!'  that  remain  on  the  left  hand  side  of  equation  (40)  above  the 
I11'  digit  (we  use  the  word  remain  because  we  have  erased  some  equations  at  earlier  steps  in  the  scan).  Let 
i"  be  the  corresponding  digits  of  i1  and  let  //  be  the  number  of  digits  of  d" .  Then  the  number  of  solutions  to 
equation  (40)  is  2T  times  the  number  of  i"  such  that 

f  d"  +  i"  <  2A‘  if  e  =  0 
\  c l"  +  i"  >  2A‘  if  e  =  1 

This  number  is  computed  as  before  which  completes  the  j  =  0  case  and  therefore  completes  our  computation 
of  C-B  miss. 

Note  that  the  algorithm  described  here  to  handle  the  case  j  =  0  has  complexity  O(p).  Also  note  that  for 
each  choice  of  initial  two  digits  in  ©(/,  j ),  the  solutions  to  equation  (34)  where  j  =  0  are  either  contained 
in,  or  else  disjoint  from,  the  solutions  where  [  ^+eh’J~1)  j  ^  |^3+9fij)  j  q  js  trivial  to  determine  whether 
there  is  inclusion  by  examination  of  the  initial  two  digits  chosen  which  takes  care  of  any  overcounting  that 
results  from  iteration  steps  (  /.  k .  j )  that  are  enumerated  in  both  of  these  cases. 

4.3  Computing  B  miss 

In  this  subsection  we  finish  the  analysis  of  misses  by  computing  B  miss.  The  quantity  B  miss  counts  the 
number  of  interation  points  (/.  k,j)  at  which  the  matrix  element  B^j  is  not  in  cache,  having  been  there 
previously. 

If  Bk,j  is  in  the  same  cache  block  as  B^j- 1,  (Note:  This  case  will  ai'ise  if  cr0  =  1  and  //2  +  @(k,j)  ^ 
1,  2,  3  mod  4  or  if  cricr0  =  10  and  p2  +  0(&,i)  =  2,  3  mod  4.)  any  collisions  that  forced  B^j  out  of 
cache  must  have  occured  at  iteration  step  (  /.  k,  j  —  1)  if  the  collision  occurs  with  an  element  of  C  or  at  step 
(  /.  /,• .  j )  if  the  collision  occurs  with  an  element  of  A.  Using  arguments  that  are  similar  to  those  in  previous 
cases,  we  see  that  these  instances  are  enumerated  by  the  Extended  AB  Algorithm  and  the  Extended  AC 
Algorithm.  So  we  only  need  to  examine  the  cases  where  B^j  and  B k,j-i  occupy  different  cache  words. 
This  analysis  depends  on  the  form  of  a  and  so  we  need  to  consider  different  cases. 

4.3.1  Computing  B-A  miss 

We  want  to  add  one  to  B-A  miss  if  there  is  an  iteration  step  (.r ,  z,y)  with 

(i,k  -  1,  j  +  1)  <  (x,  z,y)  <  {i,k,j) 


26 


for  which  the  matrix  entry  occupies  the  same  cache  word  as  Bk,j-  Note  that  x  =  i  and  that  z  =  k  —  1 
or  z  =  k.  We  are  free  to  choose  y  as  long  as  we  choose  y  >  j  +  1  when  z  =  k  and  y  <3  when  z  =  k. 
Case  1:  /f ,  is  in  the  same  cache  word  as  j  but  not  in  the  same  cache  word  as  j_i .  (Note:  This  case 
will  arise  if  04  op  =  10  and  y-2  +  0(A%  j )  =  1  mod  4  or  if  04  op  =  00  and  y-2  +  @(k,j)  =  1,  2,  3  mod  4). 


In  this  case,  B^j  is  brought  into  cache  at  iteration  step  (/,  k,  j  —  1).  The  enumeration  of  misses  in  this 
case  is  similar  to  previous  cases. 

Enumerating  (/,  A;,  j)  which  satisfy  these  conditions  is  equivalentto  counting  (/,  Ay  j)  which  are  solutions 


to: 


|  C2  +  | 

4  “J 


^ii  +  @(i,  k)  ^ 


mod  '2f  2 


(41) 


(we  can  then  choose  any  y  <  j )  OR  which  are  solutions  to: 


|^2  +Q(fc,j)j  =  ^1  +  &{i,k-  l)j  mod  2p_2 


(42) 


with  j  <  n  —  1  (we  can  then  choose  any  y  >  j  +  1).  In  doing  so,  we  must  be  careful  to  count  those 
(/,  Ay  j )  which  satisfy  both  sets  of  equations  only  once.  Fortunately,  because  we  are  in  Case  2,  we  know  that 
L^2+Q(fc,j)  j  _  |^2+9(fc-i,j)  j  g0  we  can  repiace  equation  (42)  with  an  identical  equation  which  has  k  —  1 
in  place  of  k  on  the  left  hand  side.  When  this  is  done,  equation  (42)  is  identical  to  equation  (41)  with  the 
variable  k'  =  k  —  1  in  place  of  k.  So,  to  count  (  /.  Ay  j)  which  are  solutions  to  at  least  one  of  equation  (41) 
or  equation  (42),  we  can  just  count  solutions  to  equation  (41).  The  number  of  solutions  to  equation  (41)  can 
be  computed  as  in  previous  cases. 


Case  2:  /A is  in  a  different  cache  block  from  both  B^~ ij  and  Bkj-i.  (Note:  This  arises  exactly  when 
/'2  •  <-)  ( A- .  / )  0  mod  4). 

In  this  case,  the  most  recent  previous  access  of  /A ,  was  at  iteration  step 

(/  —  1,  A’  +  3,  j )  if  04 oo  =  00 

(/  —  1,  A’  +  1,  j  +  1)  if  040-0  =  01  or  10 

(/  —  1,  Ay  j  +  3)  if  o\oq  =  11. 

We  will  consider  just  one  of  these  possibilities  -  the  others  arc  handled  in  similar  ways.  Assume  that 
04  op  =  10. 

The  iteration  step  (/,  Ay  j)  contributes  to  B-A  miss  if  there  is  an  iteration  step  (.r.  :  .  y )  with 

(*-  +  +  <  {xc,y)  < 


satisfying 

|^3  +  0(A4j)j  =  |^pi  +  0(.r,  s)  j  mod  2P_ 2 

This  is  similar  to  the  exceptional  case  for  C-B  miss  where  //3  +  O  ( / .  j )  =  0  mod  4.  We  enumerate 
solutions  in  a  similar  way. 

Note  that  B-B  miss  and  B-C  miss  arc  computed  in  ways  quite  similar  to  B-A  miss,  dividing  the  analysis 
in  the  same  Case  1  and  Case  2.  For  B-B  miss,  we  arc  counting  triples  (/.  k,j)  such  that  the  array  element 
/A/,.  ,  was  in  cache  but  was  removed  because  array  element  Bz>y  took  its  place  in  cache.  For  B-C  miss,  we 
are  counting  triples  (/,  Ay  j)  such  that  /A /.  ,  was  displaced  from  cache  by  C.r.!r 


This  completes  the  enumeration  of  cache  misses. 
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5  A-way  Associative  Cache 


In  this  section,  we  indicate  the  changes  needed  to  generalize  our  enumeration  of  cache  misses  from  direct 
mapped  cache  to  the  case  of  an  h-way  associative  cache  (Figure  2).  In  this  case,  memory  location  M  is 
mapped  to  the  cache  set  A  =  J  mod  p  ,  where  p  =  —  is  the  number  of  cache  sets.  A  contains  A 
cache  blocks  (each  consisting  of  four  memory  locations,  as  explained  in  Section  3)  that  are  tilled  according 
to  either  the  first-in,  first-out  (FIFO)  protocol,  the  least  recently  used  (LRU)  protocol  or  random  fill  [6].  LRU 
gives  the  best  performance  but  is  usually  the  most  difficult  to  describe.  We  will  show  the  analysis  given  the 
LRU  protocol. 


(a)  (b)  (c) 

Figure  2:  (a)  cache  word  (b)  cache  block,  size  4  (c)  cache  set,  A=3 

Assuming  the  LRU  protocol,  a  cache  block  is  evicted  on  a  cache  miss  when  its  last  access  lies  furthest 
back  in  time.  I.e.,  in  our  framework,  we  must  enumerate  instances  where  a  matrix  element  A'  is  accessed 
and  brought  into  the  cache  set  A,  and  where  at  least  A  times,  since  the  previous  access  of  A',  different  matrix 
elements  arc  accessed  that  are  not  in  cache  and  which  arc  mapped  to  the  same  cache  set  A.  We  will  use 
the  term  collisions  for  such  instances  and  call  these  instances  collisions  with  X.  For  more  specificity,  we 
will  characterize  collisions  according  to  what  kinds  of  array  elements  are  involved.  So,  we  will  talk  about 
C-A  collisions  meaning  instances  when  an  array  element  from  A  is  brought  into  the  same  cache  set  as  an 
element  of  C  between  consecutive  accesses  of  that  element  of  C.  The  relationship  between  collisions  and 
cache  misses  is  straightforward  -  when  we  access  a  matrix  element  A',  we  will  have  a  cache  miss  if  there 
have  been  greater  than  A  collisions  with  A'  since  the  previous  access.  Thus,  misses  constitute  a  subset  of 
collisions. 

In  the  following,  we  will  show  the  analysis  for  C  collisions.  According  to  our  strategy,  we  will  enu¬ 
merate  iteration  steps  (  /.  k,  j)  according  to  the  number  of  collisions  of  type  C-A,  C-B,  C-C,  C-AB,  C-AC, 
C-BC,  and  C-ABC  that  have  occurred  between  the  access  of  C;j  at  iterations  step  (i,k  —  1,  j  +  r)  and 

The  considerations  that  go  into  enumeration  of  collisions  will  be  very  similar  the  considerations  that 
went  into  the  enumeration  of  cache  misses  in  the  direct  mapped  case,  but  the  general  enumeration  framework 
will  be  somewhat  more  challenging.  Instead  of  dividing  the  analysis  according  to  which  matrix  is  being 
accessed,  we  will  divide  the  analysis  according  to  the  number  of  iteration  steps  since  the  most  recent  access 
of  the  matrix  element  under  consideration. 

Consider  the  situation  where  we  access  a  matrix  element  A'  at  iteration  step  ( /' .  k .  j ) .  At  this  point,  we 
will  not  yet  specify  which  of  the  arrays  A ,  B  or  C  that  X  comes  from. 

Case  6.1:  The  matrix  element  A  was  last  accessed  at  iteration  step  ( /' .  k .  j  —  1). 

In  this  case,  the  access  of  A*  at  iteration  step  (  /.  k .  j )  can  only  cause  a  cache  miss  for  L  =  1,2.  This  case 
can  be  handled  using  arguments  from  the  previous  section.  Note  that  this  case  includes  all  A  misses  and  a 
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subset  of  the  C  misses. 


Case  6.2:  The  most  recent  access  of  V  (prior  to  iteration  step  (i,k,j))  was  at  iteration  step  (/,  k  —  1,  j  +  r ) 
for  some  r. 

At  this  point,  it  is  necessary  to  consider  which  of  the  arrays  X  comes  from. 


Consider  the  problem  of  determining  whether  there  is  a  cache  miss  with  an  A-way  associative  cache 
when  Ci,j  is  accessed  at  iteration  step  (/,  k,  j).  For  the  next  few  paragraphs,  it  is  important  to  keep  in  mind 
that  k,  j  are  fixed.  We  are  going  to  try  to  find  conditions  on  k,  j  under  which  there  will  be  at  least  A 
collisions  with  Ci,j  between  iteration  steps  (i,k  —  1,  j  +  r)  and  ( /' .  k .  j ) . 


For  C-A  collisions,  let  a  be  the  number  of  distinct  .1. ......  which  occupy  the  same  cache  set  as  C;j  and 

which  are  accessed  between  steps  (/,  k  —  1,  j  +  r)  and  ( /' .  k .  j ) .  By  that  latter  condition,  we  must  have  x  =  i 
and  :  £  {i  -  1,  k}.  So,  a  =  0,1,2  depending  on  whether  neither,  one  of,  or  both  of  u  =  k  —  1  and  u  =  k 
give  solutions  to: 


[«  +  0MJ  =  L/^1  +  e(,:,“)j  mod  2f 


(43) 


For  C-B  collisions,  let  )3  be  the  number  of  distinct  B~{y  which  occupy  the  same  cache  set  as  C;j  and 
which  are  accessed  between  steps  (/,  k  —  1,  j  +  r)  and  (/,  k,  j).  By  that  latter  condition,  we  must  have  x  =  i 
and  :£{(•-  1,  k}.  To  occupy  the  same  cache  set  as  Ci,j  we  must  have 


[w  +  e|,:'jlj  =  j  mod  2' 


(44) 


Finally  for  C-C  collisions,  let  7  be  the  number  of  distinct  CX{y  which  occupy  the  same  cache  set  as  C'hJ 
and  which  are  accessed  between  steps  (/,  k  —  1,  j  +  r)  and  (/,  k,  j).  By  the  latter  condition,  we  must  have 
x  =  i.  To  occupy  the  same  cache  set  as  Ck,j  we  must  have 


[iifMj  =  [<‘3  +  0(».!<)j  mod  2f 


(45) 


Since  x  =  i,  equation  (44)  is  equivalent  to 


l/‘3  +  e(;-j)j  =  l«±AA)J  mod  2* 


(46) 


Before  diving  into  details,  it  is  worth  discussing  the  broad  outlines  of  the  enumeration  method  that  we 
follow.  Our  immediate  goal  is  to  enumerate  C  misses  with  an  A-way  associative  cache.  More  precisely, 
we  want  to  count  iteration  steps  (/,  k,j)  where  the  most  recent  prior  access  of  Cij  was  at  iteration  step 
( / .  /,•  —  1,  j  +  r)  and  where  there  have  been  at  least  A  distinct  matrix  elements  X  inserted  into  the  same 
cache  set  as  ( between  iteration  steps  ( i,k  —  1,  j  +  r)  and  ( / .  /,• .  j ) . 

The  solutions  to  equation  (43)  characterize  those  A'  =  .1  which  collide  with  Ci,j,  the  solutions  to 
equation  (44)  characterize  those  B~:y  which  collide  with  f and  the  solutions  to  equation  (46)  characterize 
those! ' . .  which  collide  with  ( ,  all  collisions  occuring  between  iteration  steps  (/,  k  —  l,j+r)  and  (/,  k,j). 

For  any  fixed  (/,  k,  j)  there  can  be  at  most  two  C-A  collisions  because  solutions  to  equation  (43)  deter¬ 
mine  (i,k,j).  The  collisions  will  occur  when  .1  is  inserted  into  the  same  cache  set  as  f  Since  u  must 
be  either  k  or  k  —  1,  there  can  be  at  most  two  such  collisions.  There  are  two  collisions  if  (/.  k  —  1,  j )  and 
(  /.  /,• .  j )  are  simultaneously  solutions  to  equation  (43)  and  so  we  will  have  to  enumerate  such  instances. 

The  treatment  of  C-B  collisions  is  more  complicated.  The  number  of  collisions  for  a  fixed  (/,  /,• .  j )  is  the 
number  of  solutions  to  equation  (44)  with  z  =  k  and  y  <  j  plus  the  number  of  solutions  to  equation  (44) 
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with  ~  =  (k  —  1)  and  y  >  j.  If  2  m  <  p  then  :  and  y  arc  completely  determined  by  equation  (44)  and 
so  the  total  number  of  C-B  collisions  for  a  fixed  (/,  k,j)  will  be  at  most  two.  However,  if  2m  >  p  then 
you  must  consider  the  number  of  ways  you  can  extend  mod  2P  solutions  j.  z,  y  of  equation  (44).  You 
have  unrestricted  choice  of  extensions  for  :  thus  creating  2'2'E°  distinct  choices  for  k.  For  each  choice 
of  extension  of  j  you  must  count  to  extensions  of  y  so  that  y  <  j  if  ;  =  k  or  y  >  j  if  :  =  (k  —  1). 
The  number  of  such  extensions  of  y  will  dictate  the  number  of  C-B  collisions  for  this  particular  (i,k,j).  A 
crucial  consideration  is  whether  (  /.  k,j)  and  ( i,k —  l,j)  arc  simultaneously  solutions  to  equation  (44).  If 
so,  any  extension  of  y  will  create  a  C-B  collision  without  any  consideration  of  how  y  compares  to  j.  So,  we 
will  need  an  algorithm  to  determine  the  number  of  iteration  steps  (  /.  k .  j )  for  which  there  are  simultaneous 
solutions  to  equation  (44)  with  z  =  k  and  z  =  k  —  1. 

Considerations  of  extensions  also  come  into  play  when  counting  C-C  misses.  In  this  case,  k  is  arbitrary 
so  whatever  enumeration  of  collisions  we  do  for  a  fixed  / .  j  will  hold  for  all  iteration  steps  of  the  form 
(/,  k,j).  Again,  equation  (46)  determines  y  (in  terms  of  j)  mod  2f .  We  can  then  extend  j.  y  without 
restriction  in  digits  p  to  2m  —  1.  Different  extensions  of  j  give  different  iteration  steps  (  /.  /,-.  j )  (again,  k 
is  free  to  take  on  any  value).  However,  different  extensions  of  y  give  multiple  C-C  collisions  at  the  iteration 
step 

This  gives  a  framework  for  the  enumeration.  The  method  will  utilize  the  technology  we’ve  already 
developed,  with  a  couple  of  simple  extensions,  to  enumerate  solutions  to  equation  (43),  equation  (44),  and 
equation  (46).  If  2m  <  p  then  this  analysis  follows  closely  the  analysis  of  cache  misses  in  the  direct  mapped 
case  done  in  Section  4. 

So  we  will  focus  on  the  case  where  2m  >  p  where  there  are  considerations  not  previously  encountered. 
In  this  case,  we  must  consider  extensions  of  solutions  to  binary  digits  p  and  beyond.  These  extensions 
sometimes  expand  the  number  of  iteration  points  (/.  k,j)  and  sometimes  expand  the  number  of  collisions 
per  iteration  point. 

When  this  analysis  is  complete,  we  will  have  counted  C-A,  C-B  and  C-C  collisions  separately.  We  must 
then  indicate  how  to  count  iteration  points  where  there  are  simultaneous  C-A,  C-B  and  C-C  collisions.  We 
begin  with  two  technical  lemmas  that  will  be  key  to  our  analysis. 


Lemma  5.1  There  is  an  algorithm,  ALGORITHM  Dl,  that  counts  the  number  of  triples  (/,  z,j)  such  that 

(47) 


1^3  +  0(6 i)j  =  |^t  +  Q(t  s)j  =  ^t  +  Q(t  s  +  i) j  mod  2p 


4  J  L  4  J  L  4 

Moreover,  this  algorithm  has  the  same  complexity  as  the  AC  ALGORITHM. 


The  algorithm  proceeds  loops  over  possible  first  two  digits  of  (-) ( /' .  j )  and  (-)(  /.  :  ).  For  each  such  choice,  the 
algorithm  computes  if  there  is  a  contribution  to  the  total  count  and  proceeds  to  the  next  step  in  the  loop.  The 
complete  proof  is  shown  in  Appendix  A.  1 . 

There  is  a  second  situation,  similar  in  nature,  in  which  we  will  need  to  count  instances  where  two 
solutions  differ  by  just  one  in  one  of  the  variables. 


Lemma  5.2  There  is  an  algorithm,  ALGORITHM  D2,  which  counts  the  number  of  triples  (i,  k,  j)  G 
such  that  there  are  simultaneous  solutions  v ,  x  E  Bm  to: 

i  4/3  +  0(7,  j)  |  ,  p2 +  Q(fr,-iQ  |  ,  p2  +  Q(£  ~  M)  |  /* q. 

L  4  J  L  4  J  L  4  J  (  - 
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In  addition,  this  algorithm  will  determine  the  number  of  solutions  to  equation  (48)  which  satisfy  v  <  J  <  x. 
The  complexity  of  this  algorithm  is  O  (p). 


The  complete  proof  is  somewhat  lengthy.  It  can  be  found  in  Appendix  A.2. 

We  arc  now  ready  to  enumerate  cache  misses  with  an  A-way  associative  cache  using  the  strategy  outlined 
above.  We  introduce  two  more  pieces  of  notation  to  ease  discussion.  First,  let  E 0  and  E\  be  the  number  of 
cr8  with  i  >  p  which  are  equal  to  0  and  1  respectively.  Let  I.  =  E(i  +  Ip  •  Note  that  E  =  max {'2m  —  />.  0}. 
Second,  when  referring  to  a  variable  that  occurs  in  one  of  the  equations  eq  uation  (43)  —  eq  uation  (46)  we 
will  use  r  to  denote  the  digits  in  the  variable  r  that  occur  in  the  equation  taken  mod  2 p  and  (p.  </.  r)-2P  to 
denote  that  all  variables  in  the  tuple  should  be  taken  mod  2f,  i.e.  triple  (p.  q,r)2P  contains  f>.  q  and  r.  Note 
that  T  r  if  7.  0.  Let  us  first  enumerate  C-A  collisions. 

5.1  Enumeration  of  C-A  collisions 

Step  1:  Using  the  methods  from  Sections  2-4,  determine  NS,  the  number  of  triples  (i,  j,  u) 2P  which  satisfy 
equation  (43).  Using  ALGORITHM  D1  from  Lemma  5.1,  determine  ND,  the  number  of  triples  (/.  j.  u)2P 
such  that  (  /.  j,  u)-2p  and  (/.  j,  u  —  I  )2P  simultaneously  satisfy  equation  (43). 

Step  2:  There  are  (NS  —  ND)  ■  2/,;°l+2/,;  iteration  steps  (/.  k,j)  at  which  there  has  been  a  single  C-A 
collision  since  the  previous  access  of  C'j-j.  There  are  ND  ■  2Eo+2El  iteration  points  (  /.  k,  j)  at  which  there 
have  been  two  C-A  collisions  since  the  previous  access  of  t'ij  ■ 

5.2  Enumeration  of  C-B  collisions 

This  is  the  far  more  interesting  case  because  elements  of  both  B  and  C  (C'[/,  j]  and  B[k,  j ])  arc  less  “well 
behaved”  than  those  .1  [/.  k]  of  aiTay  A.  Thus,  subsequently  we  show  the  approach  in  full  length. 

Step  1:  Using  the  methods  from  Sections  2-4,  determine  NS,  the  number  of  triples  T  =  (  /.  j.  z)-2P  having 
the  property  that  there  is  a  y  such  that  (/.  j,  z,y) 2P  satisfies  equation  (44).  To  each  such  triple  T  we  attach 
a  multiplicity  m[T],  this  being  the  number  of  JJ.  Equation  equation  (44)  almost  completely  determine  y 
however  this  multiplicity  may  arise  if  there  is  more  than  one  choice  of  initial  digits  for  y  which  give  the  same 
carry  in  p  >  +  (-) ( : .  y )  from  digits  0, 1  to  digit  2.  Using  ALGORITHM  D2  from  Lemma  5.2,  determine  ND, 
the  number  of  triples  (i,j,  k) 2p  such  that  there  are  simultaneous  solutions  (  /.  j.  k,  u)2P  and  (  /.  j.  k  —  1,  x)-2P 
to  equation  (44).  Also,  using  ALGORITHM  D2,  determine  ND1,  the  number  of  triples  (  /.  j,  k)2P  such  that 
there  arc  simultaneous  solutions  (/.  j.  k,  u)2P  and  (/.  j.  k  —  1,  x)-2P  to  equation  (44)  with  x  <  j  <  u.  Again, 
we  will  attach  a  multiplicity  to  each  of  these  solutions. 

Step  2:  The  next  step  differs  significantly  depending  on  whether  I.  >  0  or  T,  =  0,  i.e  whether  2m  f  p 
or  2  m  >  p.  We  divide  into  those  two  cases,  of  which  the  latter  is  the  more  interesting. 

Case  1:  E  =  0 

In  this  case,  our  enumeration  is  straightforward.  There  are  ND  1  iteration  points  (  /.  k,  j)  in  which  there  are 
two  collisions  of  the  forms  71  =  ( / .  /,• .  j,  u)  and  72  =  (/,  k  —  1,  j,  x).  Each  of  71  and  72  must  be  counted 
according  to  its  multiplicity.  There  arc  NS  -2-ND1  iteration  points  where  there  has  been  a  unique  collision 
(which  must  be  counted  with  multiplicity). 

Case  2:  E  >  0: 

In  this  case,  the  enumeration  is  more  challenging.  Consider  a  solution  T  =  (/,  z,j)  2P  counted  in  the  number 
NS.  It  is  enumerated  because  there  exists  a  y  such  that  (/.  j.  z,  y)2P  is  a  solution  to  equation  (44).  Let 
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T-2P  =  (/,  ;  —  1,  j)-2p.  Assume  first  that  T2p  is  not  also  a  solution  to  equation  (44).  Let  (/,  k,  j)  be  any  triple 
of  numbers  that  extend  T.  Any  extension  of  y  will  count  a  collision  that  occurs  when  B  /.  j  is  accessed  at 
iteration  step  (/,  k,  y)  so  long  as  y  <  j.  Let  <f>(j)  =  J  and  \eX_<t>(y)  =  |_^J-  If  <t>{y)  <  <t>{j),  then  y  <  j. 
If  f(y)  >  4>{j)  then  y  >  j.  If  <f>(y)  =  4>(j),  then  y  <  j  iff  y  <  j.  So,  <f>(j )  is  an  estimate  for  the  number  of 
collisions  that  is  correct  to  within  one. 

On  the  other  hand,  the  extension  of  y  is  arbitrary.  So,  for  every  solution  T  =  (  /.  :.  j)  >i  to  equation  (44) 
counted  by  NS  which  is  not  counted  by  ND,  and  for  every  choice  of  o  E  {0,1,...,  2/,;  —  1}  there  arc  22 ' E° 
iteration  steps  (/,  k .  j )  for  which  the  number  of  C-B  collisions  is  o  times  the  multiplicity  of  the  triples,  and 
this  estimate  is  correct  to  within  the  multiplicity. 

Assume  now  that  T2p  is  also  a  solution  to  equation  (44).  Such  pairs  (T,  T2p)  are  enumerated  by  AL¬ 
GORITHM  D2:  their  contribution  to  the  analysis  above  must  be  subtracted  out  as  a  first  step.  In  this  case, 
every  one  of  the  2('2Eo+El'>  extensions  (i,  k,  j)ofT  =  (/,  z,  j)2p  is  an  iteration  step  at  which  there  have  been 
m  [T]  ■  2/,;  collisions  between  the  access  of  L ,  at  iteration  step  (/,  k  —  1,  j  +  r )  and  current  access. 

The  reasoning  is  as  follows.  Let  (/,  z,j,  u)2p  and  (/,  :  —  1,  j,  x )2p  be  the  simultaneous  solutions  to 
equation  (48).  There  are  2^2Eo+EA  extensions  of  (/,  z,j)2p  to  a  triple  (/,  k,j).  Let  <p  G  Be1-  If  c>  <  4>(j) 
then  we  assign  <f>  to  be  an  extension  of  u  to  a  u  <  j  so  that  there  is  a  collision  when  B  is  accessed  at 
iteration  step  (/,  k,u).  If  -  -  >  4>{j)  then  we  assign  to  be  an  extension  of  sx  to  an  x  >  j  so  that  there  is  a 
collision  when  Bk~\,x  is  accessed  at  iteration  step  (/,  k  —  1,  x ). 

5.3  Enumeration  of  C-C  collisions 

Let  NS  be  the  number  of  solutions  to  equation  (46)  which  we  can  compute  using  the  methods  in  Sections  2- 
4.  For  every  choice  of  solution  (/,  j,  y)2p  to  equation  (46)  there  are  2Eo+El  ways  to  extend  i,  j  to  i,  j  E  Bm. 
For  each  such  pair  of  extensions  there  arc  2m  ways  to  choose  a  k  to  complete  the  determination  of  the 
iteration  step  (/,  k,j).  Then  every  one  of  the  2/,;  possible  extensions  of  7]  to  y  E  Bm  indexes  a  collision 
between  ( and  ( that  occurs  between  the  access  of  Ch]  at  iteration  step  (/,  k—  1,  j  +  r )  and  the  access 
at  iteration  step  (/,  k,j).  If  y  <  j  then  the  collision  occurs  at  the  iteration  step  (/,  k,  y),  whereas  if  y  >  j 
then  the  collision  occurs  at  iteration  step  ( i,k —  I .  y ) .  There  is  one  exception  to  this  analysis.  Clearly  y  =  j 
is  a  solution  to  equation  (46)  if  and  only  if  the  choice  of  initial  digits  for  y  and  j  are  identical.  So,  it  is 
straightforward  to  compute  R,  the  number  of  solutions  to  equation  (46)  in  which  y  =  j.  The  significance 
of  these  R  solutions,is  that  if  y  =  j,  then  y  =  j  is  a  possible  extension  of  y  in  which  case,  the  collision  we 
count  above  is  not  genuine. 

To  summarize,  there  arc  NS  ■  2Eo+El+m  iteration  points  where  there  have  been  C-C  collisions.  Of 
these,  in  R  ■  2Eo+El+m  cases  there  have  been  2/,;  —  1  collisions  and  in  (NS  —  R)  ■  2Eo+El+m  there  have 
been  2El  collisions. 

The  following  chart  summarizes  the  analysis  of  C-A,  C-B,  and  C-C  collisions  above. 


Type 

#  of  iteration  pts 

#  of  Collisions 

C  -  A 

AT  C  .  q.E’o+2-.E'i 

^  equation  (43)  *  ^ 

1  or  2 

C  -  Bx 

{NSequation  (44)  -  2  •  NSequation  (48))  •  22'e°  ■  (#4>  <  ■2e~1) 

<{> 

2El  -  <f> 

C  -  b2 

AT  Q  oVWo+E! 

1  v  equation  (48)  '  ^ 

2El 

C  -  C 

NS  ■  2Eo+El+m 

2Ei  or  (2El  -  1) 
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It  remains  to  enumerate  iteration  points  that  fall  into  more  than  one  of  those  categories.  To  understand 
the  need  for  this,  assume  for  example  that  2El  <  A  <  2'2'El.  Then  no  iteration  point  would  exhibit  more 
than  A  collisions  of  a  single  type  C-A,  C-B  or  C-C.  However,  if  there  exists  and  iteration  point  that  is 
simultaneously  of  type  C-B2  and  C-C,  then  there  would  be  at  least  2?'El  >  A  collisions  between  the  access 
of  Cj-j  at  (  /.  k  —  1,  j  +  r)  and  the  access  at  (/,  k,  j).  So  there  would  be  a  C  miss  at  (/,  k,  j)  with  an  A-way 
associative  cache. 

The  way  we  proceed  is  largely  similar  to  what  we  have  already  shown  in  Section  4  for  the  direct 
mapped  case.  The  lengthy  analysis  shown  in  Appendix  A.  3  yields  a  method  to  enumerate  iteration  points 
by  number  of  C  collisions.  Using  this  method,  we  can  determine  4>{C,  t),  the  number  of  iteration  points 
(/.  k,j)  for  which  there  have  been  exactly  t  collisions  with  ( between  the  prior  access  of  ( and  the 
access  at  (/.  k,j).  Assuming  an  A-way  associative  cache  with  a  LRU  protocol  the  number  of  C  misses  is 
£t> A  *«?,*)• 

At  this  point  we  have  indicated  how  to  enumerate  A  misses  and  C  misses  in  the  case  of  an  A-way 
associative  cache.  It  remains  to  enumerate  B  misses.  Since  the  technical  difficulties  we  encounter,  as  well 
as  the  ideas  we  use  to  overcome  these  difficulties,  are  similar  to  those  seen  in  the  enumeration  of  C  misses 
we  leave  details  to  the  reader. 

The  extension  to  first  in  first  out  (FIFO)  replacement  is  straightforward.  Here,  the  requirement  that  the 
accessed  matrix  elements  are  different  is  dropped  in  the  definition  of  a  collision. 

6  Conclusions 

This  paper  introduced  a  class  of  array  layouts,  interleavings ,  and  efficient  algorithms  to  exactly  assess 
the  number  of  cache  misses  caused  by  such  layouts  when  used  in  the  context  of  matrix  multiplication. 
The  layouts  are  described  by  bit-level  address  manipulations,  and  cache  misses  are  counted  by  reasoning 
about  the  solutions  to  simple  bit-level  equations.  Most  importantly,  we  achieve  a  reduction  in  complexity 
from  0(2m  +  p)  to  (9(max(m,p))  with  respect  to  the  naive  algorithm  by  exploiting  properties  of  carry 
propagation.  Although  there  are  various  subcases  in  the  analysis  of  cache  misses,  each  case  can  be  ultimately 
reduced  to  one  of  two  combinatorial  enumeration  problems. 

A  particulai-  strength  of  our  techniques  is  that  it  explicitly  handles  cross  interference  between  arrays, 
which  is  generally  considered  to  be  difficult  to  handle.  Also,  our  model  allows  an  elegant  extension  to  a 
set-associative  cache  with  LRU  replacement  strategy. 

Our  current  work  has  several  limitations.  First,  we  have  thus  far  provided  an  analysis  only  of  matrix 
multiplication,  and  for  2m  x  2m  matrices  at  that.  It  seems  likely  that  the  ideas  can  be  generalized  to  handle 
other  computations,  but  this  remains  to  be  demonstrated.  Second,  a  number  of  special  cases  arise  in  dealing 
with  the  least  significant  bits  of  a  that  are  truncated  when  converting  a  memory  address  to  a  block  address. 
Our  restriction  to  a  cache  block  size  of  four  elements  required  us  to  handle  only  two  bits  of  a,  but  the 
problem  could  be  more  acute  for  larger  block  sizes  ( e.g .,  in  analyzing  TLB  behavior).  Finally,  our  use  of 
inclusion-exclusion  poses  the  imminent  danger  of  combinatorical  explosion  when  the  interaction  of  many 
arrays  has  to  be  calculated.  However,  as  mentioned  in  Section  2.3,  this  case  can  be  adequately  approximated 
as  many  of  these  intersections  are  empty  or  sparse. 

Our  immediate  future  work  will  tackle  the  optimization  problem  of  determining  layout  functions  that 
minimize  the  number  of  cache  misses  for  matrix  multiplication.  There  are  also  related  problems — such  as 
counting  compulsory  misses,  differentiating  capacity  and  compulsory  misses,  and  identifying  cache  contents 
at  the  end  of  executing  a  loop  nest — for  which  efficient  algorithms  remain  to  be  found. 
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A  Proofs  and  Details  for  Section  5 


A.l  Proof  of  Lemma  5.1 

Proof:  First,  the  algorithm  will  loop  over  the  possible  first  two  digits  of  (-)  ( /' .  j ) .  As  before  in  Section  3,  we 
will  let  //',  +  ©'(/',  j')  denote  the  paid  of  p3  +  ©(/,  j)  in  digits  2  to  (p  —  1)  where  the  carry  from  digits  0,  1 
are  incorporated  into  //', . 

Similarly,  loop  over  the  possible  first  two  digits  of  0(7.  :  )  (some  of  which  might  have  already  been  fixed 
because  i  is  common  to  0 ( /' .  j )  and  0(7.  :  ).  When  0(7.  :  +  I  )  is  computed  from  0(7.  :  )  there  will  be  some 
carry  e  G  B2  from  the  paid  of  :  +  I  that  occurs  in  the  first  two  digits  of  0(7.  :  +  I )  to  the  paid  of  :  +  I 
that  occurs  in  the  digits  2  —  (p  —  1)  of  (-) ( i .  :  +  I  ) .  Note  that  e  is  determined  by  the  first  two  digits  of 
0(7.  :  )  that  we  are  looping  over.  Lastly,  let  71  and  72  be  the  carry  from  digit  one  to  digit  two  in  // 1  +  0(7.  :  ) 
and  pi  +  0(i,  ;  +  1)  respectively.  We  use  the  prime  notation  from  Section  4  to  denote  digits  2  —  (p  —  1). 
Combining  all  this  we  have 


The  equalities  in  the  equation  above  are  modulo  2P  2  but  still  it  is  clear  that  the  only  possible  ways  in  which 
they  can  be  realized  are  if 

1)  7i  =  72  =  e  =  0 

or 

2)  7!  =  e  =  1,  72  =  0  and  a-2  =  1. 

So  the  algorithm  proceeds  in  the  following  way.  It  loops  over  possible  first  two  digits  of  ©(/,  j  )  and 
Q(i,  z).  For  each  such  choice,  the  algorithm  computes  71, 72  and  e.  If  neither  1)  or  2)  above  is  satisfied, 
then  there  is  no  contribution  to  the  total  count  and  the  algorithm  proceeds  to  the  next  step  in  the  loop.  If 
either  1)  and  2)  above  is  satisfied,  then  the  algorithm  computes  the  number  of  solutions  to  the  equation: 


/4  +e'(i',j')  =  p'1  +  e'(i',z') 


(50) 


using  the  AC  ALGORITHM  and  adds  that  number  to  the  total. 


A.2  Proof  of  Lemma  5.2 

Proof:  We  will  need  some  terminology  and  notation  to  explain  this  algorithm.  Let  I  be  the  number  of  0’s  in 
the  set  {<70,  07  }.  The  initial  digits  of  either  k  or  k  —  1  will  refer  to  the  first  /,  i.e.,  those  that  appeal-  in  the 
first  two  digits  of  0(7,-.  r  )  and  O  ( /,•  —  I .  ./• ) .  Let  v  be  the  minimal  index  greater  than  1  with  cr  „  =  0.  So  we 
have  a-2  =  o-3  =  •  •  •  =  cr  u—\  =  1. 

The  first  step  in  this  algorithm  is  to  loop  over  choices  for  the  initial  digits  of  /.•  —  I  (which  will  also 
determine  the  initial  digits  of  k.  Let  r  •  21  be  the  carry  when  1  is  added  to  the  initial  digits  of  k  —  1.  Note 
that  t  is  carried  to  the  vth  binary  digit  when  ©(/,•.  r  )  is  computed  from  Q(k  —  I .  r  ). 

The  next  step  is  to  loop  over  possible  carries  e0,  ei  and  e2  from  the  zero  and  first  binary  digits  to  the 
second  binary  digit  in  p3  +  0(/,  j),  p2  +  ©(&,  v)  and  p2  +  ©(/,•  —  I .  ./•  )  respectively.  We  compute  the  number 
A)i  A)  A  2  where  A0  is  the  number  of  choices  for  the  zero  and  first  binary  digits  of  p3  +  0(7'.  j )  which  will 
result  in  a  carry  of  e0 ,  where  A)  is  the  number  of  choices  of  initial  digits  of  v  and  ./•  respectively  which  will 
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result  in  carries  of  ei  and  e2  respectively  (given  the  choices  we’ve  already  made  for  initial  digits  in  k  —  I 
and  /,•). 

With  this  notation  and  the  prime  notation  we  can  express  equations  equation  (48)  as: 

f4  +  +  eo  =  t-i-2  +  ®'{k',  v ')  +  ei  =  ji'2  +  0'  ((k  -  1)',  x')  +  e2.  (51) 

So, 

©'((&  -  1)',  „')  +  d  +  r  •  2V~2  =  @'{{k  -  1)',  x')  +  c2.  (52) 

Rewriting  equation  (52)  we  obtain, 

(ei  -  c2)  +  T  •  2V~2  =  ©'((&  -  1)',  x')  -  ©'((&  -  1)',  v').  (53) 

Let  v  and  x  denote  v1  and  x1  taken  mod  2l/~2 .  We  note  that  the  right-hand  side  of  equation  (53)  is  equal  to 

x  —  if  +  GLOB 

where  GLOB  is  a  multiple  of  2i/_1.  Since  the  left-hand  side  of  equation  (53)  is  strictly  less  than  2i/_1,  we 
deduce  that  GLOB  =  0  and  so 

e\  —  e-2  +  t  ■  2V~2  =  x  —  v.  (54) 


Gase  1:  r  =  0. 

In  this  case,  equation  (54)  becomes 

ei  +  v  =  e2  +  x.  (55) 

Also,  ei  +  v  =  e2  +  x  is  completely  determined  by  the  equality  equation  (51).  So  there  is  exactly  one  choice 
of  v  and  x  in  this  case  for  every  /,  j.  So  in  this  case  there  arc  min{ 2P~2,  22m~2}  choices  for  /,  j.  For  each 
such  choice  there  is  exactly  one  choice  of  k.  For  this  triple  (/,  /,• .  j )  there  arc  A),  •  A)  •  N2  choices  of  Bw,z 
that  collide  with  Cij  between  its  access  at  iteration  steps  (  /.  k  —  1,  j  +  r)  and  (/,  k,  j). 

It  remains  to  determine  the  number  of  these  collisions  satisfying  v  <  j  <  x.  The  reader  will  note  that 

ei  +  v  =  e2  +  x 

so  there  are  no  such  j  unless  e\  =  1,  e2  =  0  and  in  this  case  any  j  with  v  <  j  <  x  must  satisfy 

v  =  j '  <  x  =  v  +  1.  (56) 

The  first  thing  to  check  is  whether  the  choices  of  initial  digits  for  j,  x,  v  would  mean  that  j1,  x1,  v1  which 
satisfy  equation  (56)  would  give  j,  x,  v  with  v  <  j  <  x.  If  not,  then  there  arc  no  such  j.  If  so,  we  enumerate 
j',  x\  v1  by  enumerating  solutions  to: 


/4  +  0'(4/)=F2  +  0/(^/,j/) 

using  the  methods  developed  in  Section  2. 

Gase  2:  r  =  1. 

In  this  case  equation  (54)  is  equivalent  to: 

v  +  ei  =  x  +  e2  +  2V .  (57) 

Recalling  that  £>,  x  G  Bu,  we  see  that  equation  (57)  can  have  a  solution  only  if  x  =  0,  e2  =  0,  e\  =  1  and 
r  =  2"  —  1.  In  order  for  these  choices  of  x,  v,  e\  and  e2  to  satisfy  equation  (5 1)  we  must  have  that  //', +  + 
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agrees  with  //(  in  digits  0  —  [v  —  3).  This  implies  that  digits  0  —  (//  —  3)  in  j '  arc  determined  by  //3.  e0  and 
H2.  So  there  are  min{'2p~1',  22"'-"}  choices  for  such  i,j.  Each  determines  a  unique  k  and  N0  ■  Ni  ■  N2 
pairs  w,  z  such  that  Bw>z  collides  with  C';j  between  the  access  of  ( at  iteration  steps  (/,  k  —  1,  j  +  r)  and 

Lastly,  we  need  to  determine  the  number  of  these  solutions  which  satisfy  v  <  j  <  x.  However,  in  this 
case,  v  >  x  and  so  there  are  no  such  j. 

This  completes  the  proof  of  Lemma  5.2  and  the  construction  of  ALGORITHM  D2. 


A.3  Enumeration  of  iteration  points  that  exhibit  simultaneous  collisions 


We  show  the  case  of  the  enumeration  of  iteration  points  that  exhibit  simultaneous  C-A  and  C-B  collisions 
and  thus  seek  to  enumerate  iteration  points  (/,  k .  j )  such  that  there  have  been  both  C-A  and  C-B  collisions 
between  (/,  k  —  1,  j  +  r)  and  (/,  k,j).  We  begin  by  enumeration  simultaneous  solutions  to  equation  (43) 
and  equation  (44)  but  with  the  added  factor  that  either  u  =  u  +  I  =  :  or  u  =  :  +  I .  We  split  our  analysis 
into  those  three  cases. 

We  will  consider  only  the  case  where  u  =  :  here  -  the  other  cases  can  be  handled  via  similar  methods. 
In  the  case  u  =  z,  a  simultaneous  solution  to  equation  (43)  and  equation  (44)  must  satisfy: 

p3  +  @{i,j)2p  ,  _  ,  I-H  +  ©(*,  z)2 p  i  _  /h  f  ©(:,y)2P  , 

L  A  J  —  L  A  \  —  l  A  \ 


We  enumerate  solutions  to  equation  (58)  by  first  counting  solutions  j,  Tto  the  left- most  equality  using  the 
AB-Algorithm,  but  keeping  the  choice  of  i  open  for  the  moment.  Lor  each  such  solution  j,  z,  there  is  one 
and  only  one  choice  of  i  and  y  that  satisfies  the  second  equality. 

If  /,  =  0,  then  /.  /.  :  and  y  are  determined  at  this  point.  Also,  k  is  determined,  as  k  =  z  if  y  <  j  and 
k  =  c  +  1  if  y  >  j.  Lor  this  iteration  point  (/,  k,  j ),  the  number  of  collisions  will  be  2,  3  or  4.  The  normal 
case  will  be  2  but  3  or  4  may  result  if  there  are  two  C-A  collisions  or  two  C-B  collisions  (or  both).  We  will 
discuss  these  cases  below. 

If  E  >  0  then  we  must  extend  /.  j .  ~  and  Tj  to  get  i.  j.  z  and  y.  The  consideration  on  extensions  is 
identical  to  those  above.  In  particular,  k  will  be  either  c  or  ;  •  1  depending  on  how  the  extension  of  y 
compares  to  the  extension  on  j.  In  cases  where  ~z  and  :  +  I  arc  not  both  solutions  to  equation  (44)  then  the 
number  of  collisions  will  be  1  +  <:>(j)  or  2  +  <f>(j )  for  k  =  z  depending  on  whether  there  arc  multiple  C-A 
collisions  (which  is  disussed  below).  The  number  of  collisions  will  be  1  +  2El  —  4>{j)  or  2  +  2El  —  4>{j) 
for  k  =  z  +  1  depending  on  whether  there  arc  multiple  C-A  collisions.  However,  if  T  and  :  +  I  arc  both 
solutions  to  equation  (44)  then  the  number  of  collisions  will  be  either  1  +  2  El  or  2  +  2/,;  depending  on 
whether  there  arc  multiple  C-A  collisions. 

So  we  will  need  to  enumerate  simultaneous  solutions  to  equation  (47)  and  equation  (44),  equation  (43) 
and  equation  (48),  equation  (47)  and  equation  (48).  These  enumeration  problems  can  be  solved  using  the 
tools  we  have  already  developed  and  applied  to  counting  solutions  to  equation  (43)  and  equation  (44).  So, 
we  will  omit  many  of  the  details  in  our  account  of  how  to  proceed. 

To  enumerate  simultaneous  solutions  to  equation  (47)  and  equation  (44),  we  begin  by  counting  solutions 
to 

^3  +  Q(T7)2Pj  =  |^ri  +  0(1,  s)2Pj  =  |^ri  +  0(1,  ;  +  12p)  j  =  ^2  +  @{z,y)2P  j  ^ 

To  count  solutions  to  this  equation,  use  Lemma  5.1  to  enumerate  solutions  j,  Tto  the  first  two  equalitites 
leaving  i  undecided.  With  j ,  7  fixed  there  are  unique  choices  for  i  and  Tj  which  satisfy  the  third  equality. 
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Each  choice  of  j .  T.  i,  and  y  which  satisfy  equation  (59)  must  now  be  extended.  First,  note  that  there  are 
2/,;0'+2'/,;  ways  to  extend  each  solution  j,  T.  i.  Observe  that  the  extension  of  s  must  be  k  —  I  because  of  the 
form  of  equations  equation  (59).  For  each  such  extension,  there  are  2  +  Y  collisions  of  types  C  —  A  and 
C  —  B  at  iteration  point  (  /.  k,  j)  where  Y  is  the  number  of  extensions  of  y.  Since  z  =  k  —  1,  the  Y  is  the 
number  of  extensions  of  y  is  2El  —  <f>{j)  where  <p(j)  is  the  extension  of  j. 

As  noted  in  the  previous  paragraph,  the  form  of  equations  equation  (59)  imply  that  z  =  k  —  1.  So  we 
must  also  solve  a  second  set  of  equations  which  differ  from  equation  (59)  only  in  that  the  last  Tis  replaced 
by  c  +  1.  The  enumeration  of  solutions  to  this  set  of  equations  follows  the  lines  above  with  the  only  major 
difference  being  that  there  are  <f>(j )  extensions  possible  for  y.  This  impacts  the  number  of  collisions  that 
have  occurred  at  iteration  step  ( / .  /,■ .  j ) . 

To  enumerate  simultaneous  solutions  to  equation  (43)  and  equation  (48),  we  must  count  solutions  to  the 
system  of  equations 

1^3  +  Q(9j)2pj  =  |^i  +  Q(q  ;)2Pj  =  +  @{z,y)2p j  =  +  Q(;  +  l,  x)2P ^  ^ 

To  enumeration  of  solutions  to  equation  (60),  we  use  the  AB  Algorithm  to  count  solutions  to  the  second 
equality.  The  reasoning  that  went  into  the  proof  of  Femma  5.2  give  (in  each  of  two  cases),  the  relationship 
between  y  and  x.  The  one  complication  that  arises  is  in  the  case  (from  Femma  5.2)  where  r  =  1.  In  this 
case,  the  first  v  digits  of  y'  are  determined  and  so  the  first  v  digits  of  will  also  be  determined.  This  gives 
a  partial  determination  of  T  which  must  be  factored  into  the  AB  Algorithm  as  was  done  in  Section  3. 

Each  solution  of  equation  (60)  must  be  extended.  The  number  of  extensions  is  straightforward  to  count 
in  this  case,  since  there  arc  2/,;  extensions  of  the  pair  Tj.  x. 

To  complete  this  analysis,  we  must  enumerate  simultaneous  solutions  to  equation  (47)  and  equation  (48). 
We  begin  by  enumerating  solutions  to 


I  jU3  +  0(i,  j)2p  |  _  |  Hi  +  @(i,  z)2p  |  _  |  Hi  +  @{  j,  z  +  1); 

L  A  J  —  L  A  J  —  L  A 


(61) 


_  |  Hi  +  Q(A  y)'2P  |  _  |  H-2  +  Q(;  +  li  x)-2 P  | 

—  L  4  J  —  L  4  J' 

To  characterize  solutions  to  equation  (61),  begin  with  the  first  two  equalities.  Following  the  reasoning  in  the 
proof  of  Femma  5.1,  we  can  eliminate  some  choices  for  initial  digits  of  j,  “.  /.  x,  and  Tj.  For  those  that  arc 
not  eliminated,  solutions  of  equation  (61)  are  equivalent  to  solutions  of  equation  (60)  and  so  we  can  use  the 
methods  developed  above  to  enumerate  those  solutions.  As  in  other  cases,  once  solutions  to  equation  (61) 
are  enumerated,  they  must  be  extended  to  give  simultaneous  solutions  to  equation  (47)  and  equation  (48). 
These  extensions  will  then  give  the  number  of  iteration  points.  At  each  there  will  be  2  +  2  El  collisions 
between  the  access  of  Cij  at  that  iteration  step  and  the  most  recent  previous  access. 

This  completes  the  enumeration  of  iteration  steps  where  there  have  been  both  and  C-A  and  C-B  colli¬ 
sions.  The  next  step  is  to  enumerate  iteration  points  where  there  have  been  both  C-A  and  C-C  collisions, 
iteration  points  where  there  have  been  both  C-B  and  C-C  collisions,  and  iteration  points  where  there  have 
been  all  three  of  C-A,  C-B  and  C-C  collisions.  To  shorten  this  exposition,  we  will  only  indicate  where  the 
modifications  of  the  previous  analyses  come  in. 

To  enumerate  cases  where  there  have  been  both  C-A  and  C-C  collisions  is  straightforward.  Enumerate 
solutions  j,u  of  to  equation  (43)  including  the  number  for  which  both  j,  u  and  j,  u  —  1  arc  solutions.  For 
each  of  these,  there  is  at  most  one  y  such  that  j,  Tj  satisfy  equation  (46).  There  will  exist  an  TJ  unless  the 
carries  from  the  initial  two  digits  of  the  two  sides  of  equation  (46)  are  different.  In  this  case  you  may  have  to 
eliminate  one  possible  solutions  to  equation  (43).  This  determination  can  be  made  by  an  O(p)  examination 
of  hi  and  //3.  Once  j,u  and  y  have  been  determined,  we  can  choose  i  arbitrarily  and  we  can  arbitrarily 
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extend  j,u  and  y  to  j.  u  and  y.  Different  extensions  of  i,  j  and  u  lead  to  different  iteration  points.  However, 
for  fixed  /,  j  and  u,  different  extensions  of  y  coiTespond  to  multiple  collisions. 

To  enumerate  cases  where  there  have  been  both  C-B  and  C-C  collisions  is  likewise  straightforward. 
First  enumerate  solutions  to  equation  (46).  For  every  solution  j,  v,  there  is  a  uniquely  determined  solution 
to  “.  Tj  up  to  multiplicity  m[T]  that  might  arise  from  differing  initial  digits  in  (-)(:.//).  As  before,  we 
must  determine  whether  different  choices  of  initial  digits  might  lead  to  both  F  and  :  +  I  being  solutions. 
This  is  straightforward.  Each  solution  j,  v,  F.  y  just  enumerated  must  be  extended.  The  issues  related  to 
extensions  arc  identical  to  the  issues  that  arose  in  the  enumeration  of  C-B  collisions.  We  leave  details  to  the 
reader. 

Lastly,  we  need  to  enumerate  cases  where  there  have  been  C-A,  C-B  and  C-C  collisions.  We  first 
enumerate  iteration  points  where  there  have  been  C-A  and  C-C  collisions  as  above.  For  every  such  solution, 
we  can  uniquely  solve  for  |§  y,  uniquely  up  to  choice  of  initial  digits.  As  usual,  consideration  must  be  given 
to  whether  there  77  and  u  —  1  arc  both  solutions  to  equation  (43)  and  to  whether  t and  :  +  I  arc  both  solutions 
to  equation  (44).  There  arc  no  novel  issues  that  arise  around  extensions  and  so  again  we  leave  details  to  the 
reader. 
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